UNIVERSITY OF FLORIDA LIBRARIES COLLEGE LIBRARY FOUNDATIONS OF THE THEORY OF PROBABILITY BY A. N. KOLMOGOROV TRANSLATION EDITED BY NATHAN MORRISON CHELSEA PUBLISHING COMPANY NEW YORK 1950 SV^A COPYRIGHT 1950 BY CHELSEA PUBLISHING COMPANY PRINTED IN U.S.A. « EDITOR'S NOTE In the preparation of this English translation of Professor Kolmogorov's fundamental work, the original German monograph Grundbegriffe der Wahrscheinlichkeitrechnung which appeared in the Ergebnisse Der Mathematik in 1933, and also a Russian translation by G. M. Bavli published in 1936 have been used. It is a pleasure to acknowledge the invaluable assistance of two friends and former colleagues, Mrs. Ida Rhodes and Mr. D. V. Varley, and also of my niece, Gizella Gross. Thanks are also due to Mr. Roy Kuebler who made available for comparison purposes his independent English translation of the original German monograph. Nathan Morrison N J V) Digitized by the Internet Archive in 2013 http://archive.org/details/foundationsoftheOOkolm PREFACE The purpose of this monograph is to give an axiomatic foundation for the theory of probability. The author set himself the task of putting in their natural place, among the general notions of modern mathematics, the basic concepts of probability theory — concepts which until recently were considered to be quite peculiar. This task would have been a rather hopeless one before the introduction of Lebesgue's theories of measure and integration. However, after Lebesgue's publication of his investigations, the analogies between measure of a set and probability of an event, and between integral of a function and mathematical expectation of a random variable, became apparent. These analogies allowed of further extensions; thus, for example, various properties of independent random variables were seen to be in complete analogy with the corresponding properties of orthogonal functions. But if probability theory was to be based on the above analogies, it still was necessary to make the theories of measure and integra- tion independent of the geometric elements which were in the foreground with Lebesgue. This has been done by Frechet. While a conception of probability theory based on the above general viewpoints has been current for some time among certain mathematicians, there was lacking a complete exposition of the whole system, free of extraneous complications. (Cf., however, the book by Frechet, [2] in the bibliography.) I wish to call attention to those points of the present exposition which are outside the above-mentioned range of ideas familiar to the specialist. They are the following: Probability distributions in infinite-dimensional spaces (Chapter III, § 4) ; differentiation and integration of mathematical expectations with respect to a parameter (Chapter IV, § 5) ; and especially the theory of condi- tional probabilities and conditional expectations (Chapter V). It should be emphasized that these new problems arose, of neces- sity, from some perfectly concrete physical problems. 1 1 Cf., e.g., the paper by M. Leontovich quoted in footnote 6 on p. 46; also the joint paper by the author and M. Leontovich, Zur Statistik der kontinuier- lichen Systeme und des zeitlichen Verlaufes der physikalischen Vorgdnge. Phys. Jour, of the USSR, Vol. 3, 1933, pp. 35-63. vi Preface The sixth chapter contains a survey, without proofs, of some results of A. Khinchine and the author of the limitations on the applicability of the ordinary and of the strong law of large num- bers. The bibliography contains some recent works which should be of interest from the point of view of the foundations of the subject. I wish to express my warm thanks to Mr. Khinchine, who has read carefully the whole manuscript and proposed several improvements. Kljasma near Moscow, Easter 1933. A. Kolmogorov CONTENTS Page Editors Note . iii Preface — v I. Elementary Theory of Probability § 1. Axioms 2 § 2. The relation to experimental data 3 § 3. Notes on terminology 5 § 4. Immediate corollaries of the axioms ; conditional proba- bilities ; Theorem of Bayes 6 § 5. Independence 8 § 6. Conditional probabilities as random variables ; Markov chains 12 II. Infinite Probability Fields § 1. Axiom of Continuity 14 § 2. Borel fields of probability 16 § 3. Examples of infinite fields of probability 18 III. Random Variables § 1. Probability functions 21 § 2. Definition of random variables and of distribution func- tions 22 § 3. Multi-dimensional distribution functions 24 § 4. Probabilities in infinite-dimensional spaces 27 § 5. Equivalent random variables ; various kinds of converg- ence 33 IV. Mathematical Expectations § 1. Abstract Lebesgue integrals 37 § 2. Absolute and conditional mathematical expectations .... 39 § 3. The Tchebycheff inequality 42 § 4. Some criteria for convergence • 43 § 5. Differentiation and integration of mathematical expecta- tions with respect to a parameter 44 Vll viii Contents V. Conditional Probabilities and Mathematical Expectations § 1. Conditional probabilities 47 § 2. Explanation of a Borel paradox 50 § 3. Conditional probabilities with respect to a random vari- able 51 § 4. Conditional mathematical expectations 52 VI. Independence; The Law of Large Numbers § 1. Independence 57 § 2. Independent random variables 58 § 3. The Law of Large Numbers 61 § 4. Notes on the concept of mathematical expectation 64 § 5. The Strong Law of Large Numbers ; convergence of a series 66 Appendix — Zero-or-one law in the theory of probability 69 Bibliography 71 Chapter I ELEMENTARY THEORY OF PROBABILITY We define as elementary theory of probability that part of the theory in which we have to deal with probabilities of only a finite number of events. The theorems which we derive here can be applied also to the problems connected with an infinite number of random events. However, when the latter are studied, essen- tially new principles are used. Therefore the only axiom of the mathematical theory of probability which deals particularly with the case of an infinite number of random events is not introduced until the beginning of Chapter II (Axiom VI). The theory of probability, as a mathematical discipline, can and should be developed from axioms in exactly the same way as Geometry and Algebra. This means that after we have defined the elements to be studied and their basic relations, and have stated the axioms by which these relations are to be governed, all further exposition must be based exclusively on these axioms, independent of the usual concrete meaning of these elements and their relations. In accordance with the above, in § 1 the concept of a field of probabilities is defined as a system of sets which satisfies certain conditions. What the elements of this set represent is of no im- portance in the purely mathematical development of the theory of probability (cf. the introduction of basic geometric concepts in the Foundations of Geometry by Hilbert, or the definitions of groups, rings and fields in abstract algebra). Every axiomatic (abstract) theory admits, as is well known, of an unlimited number of concrete interpretations besides those from which it was derived. Thus we find applications in fields of science which have no relation to the concepts of random event and of probability in the precise meaning of these words. The postulational basis of the theory of probability can be established by different methods in respect to the selection of axioms as well as in the selection of basic concepts and relations. However, if our aim is to achieve the utmost simplicity both in 2 I. Elementary Theory of Probability the system of axioms and in the further development of the theory, then the postulational concepts of a random event and its probability seem the most suitable. There are other postula- tional systems of the theory of probability, particularly those in which the concept of probability is not treated as one of the basic concepts, but is itself expressed by means of other concepts. 1 However, in that case, the aim is different, namely, to tie up as closely as possible the mathematical theory with the empirical development of the theory of probability. § 1. Axioms 2 Let E be a collection of elements ( t rj, £, . . . , which we shall call elementary events, and g a set of subsets of E; the elements of the set g will be called random events. I. 5 is a field 3 of sets. II. g contains the set E. III. To each set Ain% is assigned a non-negative real number P(A). This number P(A) is called the probability of the event A. IV. P(E) equals 1. V. // A and B have no element in common, then P(A + B)=P(A)+P(B) A system of sets, $, together with a definite assignment of numbers P(A), satisfying Axioms I-V, is called a field of prob- ability. Our system of Axioms I-V is consistent. This is proved by the following example. Let E consist of the single element $ and let g consist of E and the null set 0. P(E) is then set equal to 1 and P(0) equals 0. 1 For example, R. von Mises[l]and [2] and S. Bernstein [1]. 2 The reader who wishes from the outset to give a concrete meaning to the following axioms, is referred to § 2. 3 Cf . Hausdorff, Mengenlehre, 1927, p. 78. A system of sets is called a field if the sum, product, and difference of two sets of the system also belong to the same system. Every non-empty field contains the null set 0. Using Hausdorff 's notation, we designate the product of A and B by AB; the sum by A + B in the case where AB — 0; and in the general case by A + B; the difference of A and B by A-B. The set E- A, which is the complement of A, will be denoted by K. We shall assume that the reader is familiar with the fundamental rules of operations of sets and their sums, products, and differences. All subsets of g will be designated by Latin capitals. § 2. The Relation to Experimental Data 3 Our system of axioms is not, however, complete, for in various problems in the theory of probability different fields of proba- bility have to be examined. The Construction of Fields of Probability. The simplest fields of probability are constructed as follows. We take an arbitrary finite set E = {| t , £ 2 , . . ., £*} and an arbitrary set {p lt p 2 , . . ., p k ) of non-negative numbers with the sum Pi + p 2 + • • • + Pk — 1. g is taken as the set of all subsets in E, and we put P{ft i ,^,...,^} = ^ i + fc + - v + ^. In such cases, p u p 2 , . . . , p k are called the probabilities of the elementary events $ 1} £ 2 , . . . , $ k or simply elementary probabili- ties. In this way are derived all possible finite fields of probability in which gf consists of the set of all subsets of E. (The field of probability is called finite if the set E is finite.) For further examples see Chap. II, § 3. § 2. The Relation to Experimental Data 4 We apply the theory of probability to the actual world of experiments in the following manner: 1) There is assumed a complex of conditions, ©, which allows of any number of repetitions. 2) We study a definite set of events which could take place as a result of the establishment of the conditions S. In individual cases where the conditions are realized, the events occur, gener- ally, in different ways. Let E be the set of all possible variants d, &, . . . of the outcome of the given events. Some of these vari- ants might in general not occur. We include in set E all the vari- ants which we regard a priori as possible. 3) If the variant of the events which has actually occurred 4 The reader who is interested in the purely mathematical development of the theory only, need not read this section, since the work following it is based only upon the axioms in § 1 and makes no use of the present discussion. Here we limit ourselves to a simple explanation of how the axioms of the theory of probability arose and disregard the deep philosophical dissertations on the concept of probability in the experimental world. In establishing the premises necessary for the applicability of the theory of probability to the world of actual events, the author has used, in large measure, the work of R. v. Mises, [1] pp. 21-27. 4 I. Elementary Theory of Probability upon realization of conditions 8 belongs to the set A (defined in any way) , then we say that the event A has taken place. Example : Let the complex 3 of conditions be the tossing of a coin two times. The set of events mentioned in Paragraph ^con- sists of the fact that at each toss either a head or tail may come up. From this it follows that only four different variants (elementary events) are possible, namely: HH, HT, TH, TT. If the "event A" connotes the occurrence of a repetition, then it will consist of a happening of either of the first or fourth of the four elementary events. In this manner, every event may be regarded as a set of elementary events. 4) Under certain conditions, which we shall not discuss here, we may assume that to an event A which may or may not occur under conditions 8, is assigned a real number P(A) which has the following characteristics : (a) One can be practically certain that if the complex of con- ditions 6 is repeated a large number of times, n, then if m be the number of occurrences of event A, the ratio m/n will differ very slightly from P ( A ) . (b) If P(A) is very small, one can be practically certain that when conditions @ are realized only once, the event A would not occur at all. The Empirical Deduction of the Axioms. In general, one may assume that the system g of the observed events A, B, C, ... to which are assigned definite probabilities, form a field containing as an element the set E (Axioms I, II, and the first part of III, postulating the existence of probabilities). It is clear that O^m/n^l so that the second part of Axiom III is quite natural. For the event E, m is always equal to n, so that it is natural to postulate ?(E) =1 (Axiom IV). If, finally, A and B are non- intersecting (incompatible), then m — m 1 + m 2 where m, m lt m 2 are respectively the number of experiments in which the events A + B, A, and B occur. From this it follows that m m 1 m 2 n n n It therefore seems appropriate to postulate that P(A + B) — P(A) + P(J5) (Axiom V). § 3. Notes on Terminology 5 Remark 1. If two separate statements are each practically reliable, then we may say that simultaneously they are both reli- able, although the degree of reliability is somewhat lowered in the process. If, however, the number of such statements is very large, then from the practical reliability of each, one cannot deduce any- thing about the simultaneous correctness of all of them. Therefore from the principle stated in (a) it does not follow that in a very large number of series of n tests each, in each the ratio m/n will differ only slightly from P(A). Remark 2. To an impossible event (an empty set) corre- sponds, in accordance with our axioms, the probability P(0) = 5 , but the converse is not true: P(A) =0 does not imply the im- possibility of A. When P(A) — 0, from principle (b) all we can assert is that when the conditions © are realized but once, event A is practically impossible. It does not at all assert, however, that in a sufficiently long series of tests the event A will not occur. On the other hand, one can deduce from the principle (a) merely that when P(A) = and n is very large, the ratio m/n will be very small (it might, for example, be equal to 1/n). § 3. Notes on Terminology We have defined the objects of our future study, random events, as sets. However, in the theory of probability many set- theoretic concepts are designated by other terms. We shall give here a brief list of such concepts. Theory of Sets Random Events 1. A and B do not intersect, 1. Events A and B are in- i.e., AB — 0. compatible. 2. AB. . .2V~ = 0. 2. Events A, B, ... ,2V are incompatible. 3. AB . . . N = X. 3. Event X is defined as the simultaneous occurrence of events A, B, . . . ,N. 4. A 4- B + . . . + N = X. 4. Event X is defined as the occurrence of at least one of the events A,B,...,N. 8 Cf. §4, Formula (3). I. Elementary Theory of Probability Theory of Sets 5. The complementary set 6. A = 0. 7. A = E. 8. The system 51 of the sets A lt A 2 , . . . , A n forms a de- composition of the set E if A 1 + A 2 + . . . + A n = E. (This assumes that the sets At do not intersect,in pairs.) 9. B is a subset of A : 2? t c A. Random Events 5. The opposite event A consisting of the non-occur- ence of event A. 6. Event A is impossible. 7. Event A must occur. 8. Experiment % consists of determining which of the events A u A 2 , . . . , A n occurs. We therefore call A l9 A 2 , . . . , A n the possible results of ex- periment 51. 9. From the occurrence of event B follows the inevitable occurrence of A. § 4. Immediate Corollaries of the Axioms ; Conditional Probabilities ; Theorem of Bayes From A + A = E and the Axioms IV and V it follows that P(A) +P(A) =1 (1) P(A) =1— P(A) . (2) Since E = 0, then, in particular, P(0)=0 . (3) If A, B, . . . , N are incompatible, then from Axiom V follows the formula (the Addition Theorem) P(A +£+... +N)= P(A) + P(£) + ...+ P(N) If P(A) >0, then the quotient P(AB) (4) ?a(B) = P(A) (5) is defined to be the conditional probability of the event B under the condition A. From (5) it follows immediately that § 4. Immediate Corollaries of the Axioms 7 P(AB)=P(A)P A (B) . (6) And by induction we obtain the general formula (the Multi- plication Theorem) P(A 1 A 2 ...A n ) = P(A l )P Al (A 2 )P AlA AA 3 )...P Al A 2 ...A n - l (A n ). (7) The following theorems follow easily : P 4 (5)g0, (8) P A (E) = 1, (9) PAB + C)=?AB)+?AC). (10) Comparing formulae (8) — (10) with axioms III — V, we find that the system $ of sets together with the set function P A (B) (pro- vided A is a fixed set), form a field of probability and therefore, all the above general theorems concerning P(B) hold true for the conditional probability P A (B) (provided the event A is fixed). It is also easy to see that P^(A)=1. (11) From (6) and the analogous formula P (AB)=P(B)P B (A) we obtain the important formula : PB{A) = ^m, (12) which contains, in essence, the Theorem of Bayes. The Theorem on Total Probability: Let A 1 + A 2 + . . . + A n — E (this assumes that the events A lf A 2J . . . , A n are mutually exclusive) and let X be arbitrary. Then P(X) = PiAJ P Al (X) + P(A 2 ) P At (X) + ... + P(A n ) P An (X).- (13) Proof : X = AiX + A 2 X + . . . + A„X; using (4) we have P(X)= P(A 1 X)+P(A 2 X) + ...+ P(A„X) and according to (6) we have at the same time P(A i X)=P(A i )P At (X). The Theorem of Bayes: Let A 1 + A 2 + . . . + A n = E and X be arbitrary, then p (A , PWP^X) x( * PiAJP^W + P(A 2 )P A ,(X) + ■■■ + P(A n )P A „(X)' (14 > i = 1, 2, 3,... ., ». 8 I. Elementary Theory of Probability A lt A 2 , . . . , A n are often called "hypotheses" and formula (14) is considered as the probability P*(A { ) of the hypothesis Ai after the occurrence of event X. [P(A*) then denotes the a priori probability of A*.] Proof: From (12) we have PWP^(X) Px(Ai) P(X) To obtain the formula (14) it only remains to substitute for the probability P(X) its value derived from (13) by applying the theorem on total probability. § 5. Independence The concept of mutual independence of two or more experi- ments holds, in a certain sense, a central position in the theory of probability. Indeed, as we have already seen, the theory of probability can be regarded from the mathematical point of view as a special application of the general theory of additive set func- tions. One naturally asks, how did it happen that the theory of probability developed into a large individual science possessing its own methods? In order to answer this question, we must point out the spe- cialization undergone by general problems in the theory of addi- tive set functions when they are proposed in the theory of probability. The fact that our additive set function P(A) is non-negative and satisfies the condition P(E) = 1, does not in itself cause new difficulties. Random variables (see Chap. Ill) from a mathe- matical point of view represent merely functions measurable with respect to P(A), while their mathematical expectations are abstract Lebesgue integrals. (This analogy was explained fully for the first time in the work of Frechet 6 .) The mere introduction of the above concepts, therefore, would not be sufficient to pro- duce a basis for the development of a large new theory. Historically, the independence of experiments and random variables represents the very mathematical concept that has given the theory of probability its peculiar stamp. The classical work or LaPlace, Poisson, Tchebychev, Markov, Liapounov, Mises, and •See Frechet [1] and [2]. § 5. Independence 9 Bernstein is actually dedicated to the fundamental investigation of series of independent random variables. Though the latest dissertations (Markov, Bernstein and others) frequently fail to assume complete independence, they nevertheless reveal the necessity of introducing analogous, weaker, conditions, in order to obtain sufficiently significant results (see in this chapter § 6, Markov chains) . We thus see, in the concept of independence, at least the germ of the peculiar type of problem in probability theory. In this book, however, we shall not stress that fact, for here we are interested mainly in the logical foundation for the specialized investigations of the theory of probability. In consequence, one of the most important problems in the philosophy of the natural sciences is — in addition to the well- known one regarding the essence of the concept of probability itself — to make precise the premises which would make it possible to regard any given real events as independent. This question, however, is beyond the scope of this book. Let us turn to the definition of independence. Given n experi- ments 5l (1) , 5l (2) , . . . , 5l U) , that is, n decompositions E = Af + A$ ] + h A 1 *} i=\,2,...,n of the basic set E. It is then possible to assign r = r 1 r 2 . . . r n proba- bilities (in the general case) P^... qn = P(A ( q \ ) A%;.. A { q n J)^0 which are entirely arbitrary except for the single condition 7 that 2 Ah<? 8 ...«» = 1 • (!) Definition I. n experiments 3i (1) , 5l (2) , . . . , 3l (n > are called mutually independent, if for any q l9 q 2 , . . . , q n the following equation holds true : p (4>4? • • • O = p «>) p (4?) • • ■ p(4:') • (2) 7 One may construct a field of probability with arbitrary probabilities sub- ject only to the above-mentioned conditions, as follows: E is composed of r elements £«, q t . . . q n . Let the corresponding elementary probabilities be PqiQt...in> and finally let A q i] be the set of all £ f/l9 , tm . 9m for which <7t = q- 10 I. Elementary Theory of Probability Among the r equations in (2), there are only r-r 1 -r 2 - . . . -r n + n - 1 independent equations 8 . Theorem I. If n experiments 9l (1 \ 5l (2) , . . . , 2i (M > are mutu- ally independent, then any m of them (ra< n) , 9l (t,) , $ ( **\ .... 5( ( ' m) > are also independent 9 . In the case of independence we then have the equations : p «4« • • • 4i B> ) = p (O p C^SW • • • p (41-*) ( g ) (all 4 must be different.) Definition II. n events A u A 2 , . . . , A n are mutually indepen- dent, if the decompositions (trials) E = A k + A k (k = l,2,...,n) are independent. In this case r x = r 2 = . . . = r n = 2, r = 2 n ; therefore, of the 2 W equations in (2) only 2 n -n-l are independent. The necessary and sufficient conditions for the independence of the events A lt A 2 , . . . , A n are the following 2 n - n - 1 equations 10 : P(A {l A i2 ...A im ) = P(A il )P(A i2 )...P(A, im ), (4) m — 1, 2, . . ., n, i^i 1 <i 2 <--<i m <n. All of these equations are mutually independent. In the case n = 2 we obtain from (4) only one condition (2 2 -2 - 8 Actually, in the case of independence, one may choose arbitrarily only fi + r* 2 + . . . + t n probabilities p U) = P {A U) ) so as to comply with the n conditions 7 " <i Therefore, in the general case, we have r-1 degrees of freedom, but in the case of independence only ri + r 2 + ... + r n -n. 9 To prove this it is sufficient to show that from the mutual independence of n decompositions follows the mutual independence of the first n-1. Let us assume that the equations (2) hold. Then p («. . . <-»,) =Jp («• • • <) Qn 9n Q.E.D. 10 See S. N. Bernstein [1] pp. 47-57. However, the reader can easily prove this himself (using mathematical induction). § 5. Independence 11 1 = 1) for the independence of two events A x and A 2 : ?UiA 2 ) =P(A 1 )P(A 2 ). (5) The system of equations (2) reduces itself } in this case, to three equations, besides (5) : PiAiAz) = P(A 1 )P(A 2 ) ?{A X A 2 ) =P(A 1 )P(A a ) ?{A X A 2 ) =P(A 1 )P(A 2 ) , which obviously follow from (5). 11 It need hardly be remarked that from the independence of the events A lt A 2 , . . . , A n in pairs, i.e. from the relations P(A«A,) =P(A i )P(A i ) «*> it does not at all follow that when n>2 these events are inde- pendent 12 . (For that we need the existence of all equations (4).) In introducing the concept of independence, no use was made of conditional probability. Our aim has been to explain as clearly as possible,in a purely mathematical manner, the meaning of this concept. Its applications, however, generally depend upon the properties of certain conditional probabilities. If we assume that all probabilities P(A g (t >) are positive, then from the equations (3) it follows 13 that P«> ... 4;;«> MM = P(4?) . (6) From the fact that formulas (6) hold, and from the Multiplica- tion Theorem (Formula (7), §4), follow the formulas (2). We obtain, therefore, Theorem II: A necessary and sufficient condition for inde- pendence of experiments 5l (1) , 5l (2) , . . . , 9l (w) in the case of posi- 11 P{4iZj - P(A X ) - P{A t A 2 ) a* P{A X ) - P(A^9{A % ) = P(^){t - P(^ 2 )} »P(4 1 )P(i" a ) ,etc. 12 This can be shown by the following simple example (S. N. Bernstein) : Let set E be composed of four elements J 1 , £ 2 , £ 3 , <£, ; the corresponding elemen- tary probabilities p it p 2 , p 3 , p 4 are each assumed to be X A and A ={^,^} r JB-Wj.W. C'^ft.W, It is easy to compute that P(A) = P(B)=P(C) ="■%, P(AB)=P(BC) -P(AC) = % = (V 2 ) 2 , P(A£C) =.14 * (V 2 ) 3 . u To prove it, one must keep in mind the definition of conditional proba- bility (Formula (5), § 4) and substitute for the probabilities of products the products of probabilities according to formula (3). 12 I. Elementary Theory of Probability five probabilities P(A^ } ) is that the conditional probability of the results A q w of experiments 3t (i '> under the hypothesis that several other tests 2l (il) , 9l (i,) , ...,W ik) have hod definite results A&\AM,A i **>,...,A { £ ) is equal to the absolute probability On the basis of formulas (4) we can prove in an analogous manner the following theorem : Theorem III. // all probabilities P(A k ) are positive, then a necessary and sufficient condition for mutual independence of the events A lt A 2i . . . , A n is the satisfaction of the equations P, iA ...^(A) = PW (7) for any pairwise different i lt i 2 , . . . , i k , i- In the case n — 2 the conditions (7) reduce to two equations: P Al (A 2 ) = P(A 2 ) f | P A AA l ) = P(A 1 ). J It is easy to see that the first equation in (8) alone is a necessary and sufficient condition for the independence of A x and A 2 pro- vided P(A 1 ) > 0. § 6. Conditional Probabilities as Random Variables, Markov Chains Let 51 be a decomposition of the fundamental set E : E = A* + A 2 + . . . +A r , and x a real function of the elementary event £ T which for every set A q is equal to a corresponding constant a q . x is then called a random variable, and the sum E(x) -2a Q P(A 5 ) Q is called the mathematical expectation of the variable x. The theory of random variables will be developed in Chaps. Ill and IV. We shall not limit ourselves there merely to those random vari- ables which can assume only a finite number of different values. A random variable which for every set A q assumes the value PA qi (B), we shall call the conditional probability of the event B after the given experiment % and shall designate it by P^ (B) . Two experiments 5l (1) and 3l (2) are independent if, and only if, § 6. Conditional Probabilities as Random Variables, Markov Chains 13 P m (A?) = P(Af) q=\,2,...,r 2 . Given any decompositions (experiments) 5l (1) , 5l (2) , . . . , 9l (n) , we we shall represent by 2l (1 >2l (2) . . . $ ( »> the decomposition of set E into the products Experiments 3i (1 \ 2l (2) , . . . , % (n) are mutually independent when and only when p gB1 , a ,»...p. 1 ,(4») = P(4'), k and q being arbitrary 14 . Definition: The sequence 3l (1) , $ (2) , . . . , 5l (n) , . . . forms a Markov chain if for arbitrary n and q P«»>«<« ... w-«> W) = Pa(n-D(4 n) ). Thus, Markov chains form a natural generalization of se- quences of mutually independent experiments. If we set pQ m g n (m,n) = P A ™ (A™) m<n , then the basic formula of the theory of Markov chains will assume the form: pQ k q n (k> n) == *Zp qkqm (k, m) pg mqH (m, n) y k<m<n. (1) Qm If we denote the matrix \\p qmgn (nt, n)\\ by p(m, ri), (1) can be written as 15 : p(k,n) — p(k,m)p(m,n) k < m < n. (2) 14 The necessity of these conditions follows from Theorem II, § 5 ; that they are also sufficient follows immediately from the Multiplication Theorem (Formula (7) of §4). 16 For further development of the theory of Markov chains, see R. v. Mises [1], § 16, and B. Hostinsky, Methodes generates du calcul des probabilites, "Mem. Sci. Math." V. 52, Paris 1931. Chapter II INFINITE PROBABILITY FIELDS § 1. Axiom of Continuity We denote by 2) A m , as is customary, the product of the sets m A m (whether finite or infinite in number) and their sum by <5A m . m Only in the case of disjoint sets A m is the form ^A m used instead m of <&A m . Consequently, m ®A m = A 1 + A t + ■•■; ZA m = A 1 + A 2 +---, m ^A m = A 1 A 2 "-. In all future investigations, we shall assume that besides Axioms I - V, still another holds true : VI. For a decreasing sequence of events A 1 z)A 2 ^-" 3^ n z>.-. (1) of & for which ® A » = , (2) the following equation holds: lim P (4 n ) = . w-*oo (3) In the future we shall designate by probability field only a field of probability as outlined in the first chapter, which also satisfies Axiom VI. The fields of probability as defined in the first chapter without Axiom VI might be called generalized fields of probability. If the system J of sets is finite, Axiom VI follows from Axioms I - V. For actually, in that case there exist only a finite number of different sets in the sequence (1). Let A k be the smallest among them, then all sets A^ coincide with A k and we obtain then 14 § 1. Axiom of Continuity 15 n limP(^ B ) = P(o) = 0. All examples of finite fields of probability, in the first chapter, satisfy, therefore, Axiom VI. The system of Axioms I - VI then proves to be consistent and incomplete. For infinite fields, on the other hand, the Axiom of Continuity, VI, proved to be independent of Axioms I - V. Since the new axiom is essential for infinite fields of probability only, it is almost im- possible to elucidate its empirical meaning, as has been done, for example, in the case of Axioms I - V in § 2 of the first chapter. For, in describing any observable random process we can obtain only finite fields of probability. Infinite fields of probability occur only as idealized models of real random processes. We limit our- selves, arbitrarily, to only those models which satisfy Axiom VI. This limitation has been found expedient in researches of the most diverse sort. Generalized Addition Theorem : // A lt A,, . . . , A n , . . . and A belong to ft, then from A=Z A n (4) follows the equation Proof: Let Then, obviously ^(R n ) = 0, n and, therefore, according to Axiom VI lim P(R n ) = fi-»oo • (6) On the other hand, by the addition theorem P(A) = P(A 1 ) + P(A 2 ) + . . . + P(A n ) + P(R n ) . (7) From (6) and (7) we immediately obtain (5). We have shown, then, that the probability P(A) is a com- pletely additive set function on $. Conversely, Axioms V and VI hold true for every completely additive set function defined on n P{A)=2P(A n ). (5) n 16 II. Infinite Probability Fields any field g.* We can, therefore, define the concept of a field of probability in the following way : Let E be an arbitrary set, % a field of subsets of E, containing E, and ?(A) a non-negative com- pletely additive set function defined on gf; the field 5 together with the set function ?(A) forms a field of probability. A Covering Theorem : // A, A lt A 2 , . . . , A n , . . . belong to g and Aa(BA n i (8) n then Proof: A = A <S(A H ) =AA t + A (A 2 - A 2 A X ) + A (A 3 - A 3 A 2 - A 3 AJ + ■ • • , n ?{A) = ?(AA X ) + P{A(A 2 - A 2 A,)} + ... ^ P(^) + P(^) + •••• § 2. Borel Fields of Probability The field 5 is called a Borel field, if all countable sums2^» of the sets A n from gf belong to g. Borel fields are also called com- pletely additive systems of sets. From the formula <SA n = A 1 + (A 2 - A 2 A X ) + (A 3 - A 3 A 2 - A Z A X ) + • ■ • (1) n we can deduce that a Borel field contains also all the sums <5 A n n composed of a countable number of sets A» belonging to it. From the formula %A n = E-(BA n (2) n n the same can be said for the product of sets. A field of probability is a Borel field of probability if the corresponding field % is a Borel field. Only in the case of Borel fields of probability do we obtain full freedom of action, without danger of the occurrence of events having no probability. We shall now prove that we may limit ourselves to the investigation of Borel fields of probability. This will follow from the so-called extension theorem, to which we shall now turn. Given a field of probability (5, P). As is known 1 , there exists a smallest Borel field B^ containing 5- And we have the * See, for example, O. Nikodym, Sur une generalisation des integrates de M. J. Radon, Fund. Math. v. 15, 1930, p. 136. 1 Hausdorff, Mengenlehre, 1927, p. 85. § 2. Borel Fields of Probability 17 Extension Theorem : It is always possible to extend a non- negative completely additive set function P(A), defined in %, to all sets of B% without losing either of its properties (non- negativeness and complete additivity) and this can be done in only one way. The extended field B% forms with the extended set func- tion P(A) a field of probability (B%, P). This field of probability (B%, P) we shall call the Borel extension of the field ($, P). The proof of this theorem, which belongs to the theory of additive set functions and which sometimes appears in other forms, can be given as follows: Let A be any subset of E ; we shall denote by P* (A) the lower limit of the sums y:p(A n ) n for all coverings Acz(SA n n of the set A by a finite or countable number of sets A„ of $• It is easy to prove that P*(A) is then an outer measure in the Caratheodory sense 2 . In accordance with the Covering Theorem (51), P*(A) coincides with P(A) for all sets of 8f. It can be fur- ther shown that all sets of $ are measurable in the Caratheodory sense. Since all measurable sets form a Borel field, all sets of B% are consequently measurable. The set function P*(A) is, there- fore, completely additive on B%, and on B% we may set P(A) = P*(A). We have thus shown the existence of the extension. The unique- ness of this extension follows immediately from the minimal property of the field B%. Remark: Even if the sets (events) A of 5 can be interpreted as actual and (perhaps only approximately) observable events, it does not, of course, follow from this that the sets of the extended field B% reasonably admit of such an interpretation. Thus there is the possibility that while a field of probability (5, P) may be regarded as the image (idealized, however) of 2 Caratheodory, Vorlesungen iiber reelle Funktionen, pp. 237-258. (New- York, Chelsea Publishing Company) . 18 II. Infinite Probability Fields actual random events, the extended field of probability (B%, P) will still remain merely a mathematical structure. Thus sets of B% are generally merely ideal events to which nothing corresponds in the outside world. However, if reasoning which utilizes the probabilities of such ideal events leads us to a determination of the probability of an actual event of g, then, from an empirical point of view also, this determination will automatically fail to be contradictory. § 3. Examples of Infinite Fields of Probability I. In § 1 of the first chapter, we have constructed various finite probability fields. Let now E = {£ x , £ 2 > • • •> ln» • ■ •} be a countable set, and let 5 coincide with the aggregate of the subsets of E. All possible probability fields with such an aggregate 5 are obtained in the following manner: We take a sequence of non-negative numbers p„, such that Pi + Vi + . . . + Vn + • • • = 1 and for each set A put P(A) - 2'fin, n where the summation 2' extends to all the indices n for which $ n belongs to A. These fields of probability are obviously Borel fields. II. In this example, we shall assume that E represents the real number axis. At first, let g be formed of all possible finite sums of half -open intervals [a; b) — {a£.tj<b} (taking into consideration not only the proper intervals, with finite a and b, but also the improper intervals [- <x> ; a), [a,- + oo) and [-o©j 4- oo ) ) . g is then a field. By means of the extension theorem, how- ever, each field of probability on 5 can be extended to a similar field on B%. The system of sets B% is, therefore, in our case nothing but the system of all Borel point sets on a line. Let us turn now to the following case. III. Again suppose E to be the real number axis, while g is composed of all Borel point sets of this line. In order to construct a field of probability with the given field gf, it is sufficient to define an arbitrary non-negative completely additive set-function § 3. Examples of Infinite Fields of Probability 19 P(A) on 3 which satisfies the condition P(E) = 1. As is well known 3 , such a function is uniquely determined by its values P[-oo;x) =F(x) (1) for the special intervals [-<*>; x) . The function F(x) is called the distribution function of £. Further on (Chap. Ill, § 2) we shall shown that F(x) is non-decreasing, continuous on the left, and has the following limiting values : lim F(x) = i^-oc) = 6, lim F(x) = F( + oo) = 1 . (2) * — ► — oo a; -> -»- oo Conversely, if a given function F(x) satisfies these conditions, then it always determines a non-negative completely additive set- function P(A) for which P(E) = l 4 . IV. Let us now consider the basic set E as an n-dimensional Euclidian space R n , i.e., the set of all ordered n-tuples £ = {x u x 2 , . . . , x n j of real numbers. Let $ consist, in this case, of all Borel point-sets 5 of the space R n . On the basis of reasoning analogous to that used in Example II, we need not investigate narrower sys- tems of sets, for example the systems of n-dimensional intervals. The role of probability function P(A) will be played here, as always, by any non-negative and completely additive set- function defined on $ and satisfying the condition P(E) =1. Such a set-function is determined uniquely if we assign its values P{L aiai ...an) =F{a lt a 2 ,...,a n ) (3) for the special sets L aia% „, an , where L aia ,... an represents the aggregate of all £ for which Xi<Oi (i = 1, 2, . . . , n). For our function F (a lf a 2 , . . . , a n ) we may choose any function which for each variable is non-decreasing and continuous on the left, and which satisfies the following conditions : lim F(a v a 2> ...,«„) = F(a v . . .,«i_i, — oo,a i+1 , ...,#„) =0, "—~ t = 4, 2i ....,» f lim F(a v a 2 ,.. .,a n ) =F(+oo, +00, ..., -foo) = 1. Oi -> +00, Oj -> +00, ..., o» — ► -t-00 F(a u a 2 , . . . , a n ) is called the distribution function of the vari- ables a?i, x 2 , . . . , x n . 3 Cf ., for example, Lebesgue, Legons sur V integration, 1928, p. 152-156. * See the previous note. 8 For a definition of Borel sets in R n see Hausdorff, Mengenlehre, 1927, pp. 177-181. 20 II. Infinite Probability Fields The investigation of fields of probability of the above type is sufficient for all classical problems in the theory of probability 6 . In particular, a probability function in R n can be defined thus: We take any non-negative point function f(x u x 2 , . . . , x n ) defined in R n , such that +00 +00 +90 j j ...j f(x lt x 2 , . . .,x n )dx 1 dx 2 . . . dx n =\ —00 —00 and set P ( A ) = //••• ff( x i> x 2> •• .,x n )dx 1 dx 2 ... dx n . (5) A f(x u x 2 , . . . , x n ) is, in this case, the probability density at the point (x u x 2 , . . . , x n ) (cf. Chap. Ill, § 2). Another type of probability function in R n is obtained in the following manner: Let {£.} be a sequence of points of R n , and let {pi} be a sequence of non-negative real numbers, such that £pi = 1 ; we then set, as we did in Example I, P(A) =Z'Vi, where the summation 2' extends over all indices i for which £ belongs to A. The two types of probability functions in R n men- tioned here do not exhaust all possibilities, but are usually con- sidered sufficient for applications of the theory of probability. Nevertheless, we can imagine problems of interest for applica- tions outside of this classical region in which elementary events are defined by means of an infinite number of coordinates. The corresponding fields of probability we shall study more closely after introducing several concepts needed for this purpose. (Cf. Chap. Ill, §3). 6 Cf., for example, R. v. Mises [1], pp. 13-19. Here the existence of proba- bilities for "all practically possible" sets of an n-dimensional space is required. Chapter III RANDOM VARIABLES § 1. Probability Functions Given a mapping of the set E into a set E' consisting of any- type of elements, i.e., a single-valued function u(£) defined on E, whose values belong to E'. To each subset A' of E' we shall put into correspondence, as its pre-image in E, the set u- x (A') of all elements of E which map onto elements of A'. Let % (u) be the system of all subsets A' of E', whose pre-images belong to the field g. % (u) will then also be a field. If 5 happens to be a Borel field, the same will be true of 5 (m) - We now set poo(A') = P K 1 ^')}. (1) Since this set-function P (m) , defined on 5 (M \ satisfies with respect to the field 5 (m) all of our Axioms I - VI, it represents a proba- bility function on % (u) . Before turning to the proof of all the facts just stated, we shall formulate the following definition. Definition. Given a single- valued function u(£) of a random event £. The function P (M >(A'), defined by (1), is then called the probability function of u. Remark 1 : In studying fields of probability (5, P) , we call the function P(A) simply the probability function, but P^(A') is called the probability function of u. In the case u($) = £, P (m) (A') coincides with P(A). Remark 2: The event vr x (A') consists of the fact that u(£) belongs to A'. Therefore, P (m) (A') is the probability of u(£) c A'. We still have to prove the above-mentioned properties of % (u) and P (M >. They follow, however, from a single fact, namely: Lemma. The sum, product, and difference of any pre-image sets w -1 (A') are the pre-images of the corresponding sums, prod- ucts, and differences of the original sets A'. The proof of this lemma is left for the reader. 21 22 III. Random Variables Let A' and B' be two sets of $ (M >. Their pre-images A and B belong then to J. Since % is a field, the sets AB, A + B, and A - B also belong to g ; but these sets are the pre-images of the sets A'B\ A' + B\ and A' -B', which thus belong to ^ u \ This proves that 5 (u) is a field. In the same manner it can be shown that if g is a Borel field, so is % (u \ Furthermore, it is clear that PM(E') = P^- 1 ^)} = P(#) = 1. That P U) is always non-negative, is self-evident. It remains only to be shown, therefore, that P (m) is completely additive (cf. the end of § 1, Chap. II). Let us assume that the sets A' n , and therefore their pre-images u- 1 (A\) } a,Ye disjoint. It follows that n n n n n which proves the complete additivity of P u) . In conclusion let us also note the following. Let u x (g) be a function mapping E on E', and u 2 (t) be another function, map- ping £" on E". The product function u 2 uA£) maps E on E" . We shall now study the probability functions P (Ml) (A') and P (u HA") for the functions u r U) and u(() = UzUiU). It is easy to show that these two probability functions are connected by the follow- ing relation: ?^(A ,f )^?^){u^(A ff )}. (2) § 2. Definition of Random Variables and of Distribution Functions Definition. A real single- valued function *(£), defined on the basic set E, is called a random variable if for each choice of a real number a the set {x < a} of all | for which the inequality x < a holds true, belongs to the system of sets $• This function x(£) maps the basic set E into the set R 1 of all real numbers. This function determines, as in § 1, a field % (x) of subsets of the set R 1 . We may formulate our definition of random variable in this manner : A real function x (£) is a random variable if and only if g U) contains every interval of the form (-ooj a) . § 2. Definition of Random Variables and of Distribution Functions 23 Since g ( *> is a field, then along with the intervals (-oo,« a) it contains all possible finite sums of half -open intervals [a,- b). If our field of probability is a Borel field, then $ and 5 U) are Borel fields ; therefore, in this case % (x) contains all Borel sets of R 1 , The probability function of a random variable we shall denote in the future by P<*> (A') . It is defined for all sets of the field ft<*>. In particular, for the most important case, the Borel field of probability, P (x) is defined for all Borel sets of R 1 . Definition. The function F<*Ha) =P<*> (-*>', a) =p {x<a}, where - oo and 4- oo are allowable values of a, is called the distri- bution function of the random variable x. From the definition it follows at once that FW(-oo) =0, FW( + oo) = 1 . (1) The probability of the realization of both inequalities a^x<b, is obviously given by the formula ?{x c [a; b)} = F&{b) - F&(a) (2) From this, we have, for a < b, FW(a)§FW(5) which means that F (x) (a) is a non-decreasing function. Now let fli < a 2 < . . . < a n < . . . < b ; then ^{xa[a n ;b)} = n Therefore, in accordance with the continuity axiom, FV(b)-F(*)(a n ) = P{xcz[a n> b)} approaches zero as«-> + oo. From this it is clear that F (x) (a) is continuous on the left. In an analogous way we can prove the formulae: lim FW (a) = FW (.- oo ) = 0, a -+ - oo , (3) lim FW (a) = F« ( + oo ) = 1, a -► + oo- (4) If the field of probability (5, P) is a Borel field, the values of the probability function P<*>(A) for all Borel sets A of i^ 1 are uniquely determined by knowledge of the distribution function 24 III. Random Variables F (x) (a) (cf. § 3, III in Chap. II). Since our main interest lies in these values of P (x) (A), the distribution function plays a most significant role in all our future work. If the distribution function F (x) (a) is differentiate, then we call its derivative with respect to a, the probability density of x at the point a. a If also F (x) (a) = j f ix) (a) da for each a, then we may ex- — oo press the probability function ? (x) (A) for each Borel set A in terms of f (x) (a) in the following manner: Pto(A)=ff(*){a)da. (5) A In this case we call the distribution of x continuous. And in the general case, we write, analogously PW(A)- = fdFW\a). (6) A All the concepts just introduced are capable of generalization for conditional probabilities. The set function 9%\A) = ? B (xc:A) is the conditional probability function of x under hypothesis B. The non-decreasing function Ff(a) = P B (x<a) is the corresponding distribution function, and, finally (in the case where F^(a) is differentiate ) *?(*) = j;*VM is the conditional probability density of x at the point a under hypothesis B. § 3. Multi-dimensional Distribution Functions Let now n random variables x lt x 2 , . . . , x n be given. The point x = (x u x 2 , . . . , Xn) of the 7i-dimensional space R n is a function of the elementary event £. Therefore, according to the general rules in §1, we have a field «j(*i; *.■••.*■> consisting of § 3. Multi-dimensional Distribution Functions 25 subsets of space R n and a probability function pfe»» *»•••»■•*> (4') defined on gf'. This probability function is called the n-dimensional probability function of the random va t iables x lt x 2 , . . . , x n . As follows directly from the definition of a random variable, the field g' contains, for each choice of i and a t (i = 1, 2, . . . , n) f the set of all points in R n for which x { < a { . Therefore g' also con- tains the intersection of the above sets, i.e. the set L ai0t _ aH of all points of R n for which all the inequalities x { < a t hold (i = l,2,...,n)\ If we now denote as the n-dimensional half -open interval [tti, a 2 , . . . , a n ', Oi, b 2 , . . . , o n ) ; the set of all points in R n , for which a i ^^ i <b i , then we see at once that each such interval belongs to the field gf' since [a v a t , ...,a n ; b v b 2 , . . ., b n ) == ^b\ bt .. . b n *^o,\ b t . . . b n ^b\ a t bi ... bn * * ^b x b% ... bn-i dn ' The Borel extension of the system of all n-dimensional half- open intervals consists of all Borel sets in R n . From this it follows that in the case of a Borel field of probability ' 7 the field 5 contains all the Borel sets in the space R n . THEOREM : In the case of a Borel field of probability each Borel function x = f(x lt x 2 , . . . , x n ) of a finite number of random vari- ables x u x 2 , . . . , x n is also a random variable. All we need to prove this is to point out that the set of all points (x lt x 2 , . . . , x n ) in R n for which x = f(x u %2, . . . , x n ) <a, is a Borel set. In particular, all finite sums and products of random variables are also random variables. Definition : The function is called the w-dimensional distribution function of the random variables x lf x 2f . . . , x n . As in the one-dimensional case, we prove that the n-dimensional distribution function F (Xl ' x Xn) (a u a 2f . . . , a n ) is non-decreas- ing and continuous on the left in each variable. In analogy to equations (3) and (4) in § 2, we here have 1 The a f may also assume the infinite values ± <*> 26 III. Random Variables limF(« lf a 2 , . . ., a n ) = F(a v . . ., «,_ lf -oo, a i+1 , . . ., a n ) = 0, (7) limyfo, a,, . . ., a n ) = F(+<x>, +<x>, . . ., + oo) = 1. (8) O, — ► + 00, a t — ► +oo. . .., a M -> +oo The distribution function F< x * x * ■• ■*•») gives directly the values of P (Xl ' * 2 Xh) only for the special sets L fli a , . . . a „ . If our field, how- ever, is a Borel field, then 2 ?<*"* >*») is uniquely determined for all Borel sets in R n by knowledge of the distribution function If there exists the derivative we call this derivative the n-dimensional probability density of the random variables x u x 2 , . . . , x n at the point a u a 2r . . , a„. If also for every point (a 11 a 2 , . . . , a n ) p(xux*. ...,*„> (a x a 2 . . . an ) = | f ...jf{a lt a 2 a n )da,da 2 . . . da n , — OO — oo — oo then the distribution of x lf x 2 , . . . , se» is called continuous. For every Borel set Ac # M , we have the equality pfeu.... ..,«.) (4) -=yj. . .jf(a v a %t . . ., flji^rffl, . • • <**„. (9) 4 In closing this section we shall make one more remark about the relationships between the various probability functions and distribution functions. Given the substitution s /i. 2, .... n\ and let ^denote the transformation *i = x ik (k = 1,2, ...,n) of space i? w into itself. It is then obvious that pfrv*^. ••-,*»,) (4) = p(*i, *.,..., «w {r-i^)}. (10) Now let x' = Pk(x) be the "projection" of the space R n on the space R k (k<n), so that the point (x lf x 2 , . . . , x n ) is mappedonto the point (x u x 2t . . . , ^ fc ) . Then, as a result of Formula (2) in § 1, Cf . § 3, IV in the Second Chapter. § 4. Probabilities in Infinite-dimensional Spaces 27 p<*.,* a ,...,**>(,4) = pttk.*.....-^^-!^)}. (ii) For the corresponding distribution functions, we obtain from (10) and (11) the equations : /#*.•*«.• —"Ufo, a ia , . . ., a in ) = F<*»**< ••->^(a 1 ,a 2 a n ) , (12) pin,**. ...,**) (a lf a 2t . ..,a k ) = F x » •«•■•••*«> (a x , ...,a ft ,+oo,...,+oo).(13) § 4. Probabilities in Infinite-dimensional Spaces In § 3 of the second chapter we have seen how to construct various fields of probability common in the theory of probability. We can imagine, however, interesting problems in which the elementary events are defined by means of an infinite number of coordinates. Let us take a set M of indices /* (indexing set) of arbitrary cardinality m . The totality of all systems of real numbers x M , where /x runs through the entire set M, we shall call the space R M (in order to define an element £ in space R M , we must put each element /x in set M in correspondence with a real number % or, equivalently, assign a real single-valued function x^ of the element /*, defined on M) 3 . If the set M consists of the first n natural numbers 1, 2, . . . , n, then R M is the ordinary 7i-dimensional space R n . If we choose for the set M all real num- bers R 1 , then the corresponding space R M = R R1 will consist of all real functions ((/*) = x tt of the real variable /*. We now take the set R M (with an arbitrary set M) as the basic set E. Let I = {x^} be an element in E; we shall denote by ft* a... >»:(£) ^ ne Point {x /tl ,x iH9 ..-. t x fh )' of the n-dimensional space R n . A subset A of E we shall call a cylinder set if it can be represented in the form where A' is a subset of # w . The class of all cylinder sets coincides, therefore, with the class of all sets which can be defined by rela- tions of the form 3 Cf. Hausdorff, Mengenlehre, 1927, p. 23. 28 III. Random Variables /(**.**.- ••»**,)=-<) . (1) In order to determine an arbitrary cylinder set P Ml ^ . . . ^ (A ') by such a relation, we need only take as / a function which equals on A', but outside of A' equals unity. A cylinder set is a Borel cylinder set if the corresponding set A f is a Borel set. All Borel cylinder sets of the space R M form a field, which we shall henceforth denote by g M4 . The Borel extension of the field % M we shall denote, as always, by B% M . Sets in B% M we shall call Borel sets of the space R M . Later on we shall give a method of constructing and operating with probability functions on % M , and consequently, by means of the Extension Theorem, on B% M also. We obtain in this manner fields of probability sufficient for all purposes in the case that the set M is denumerable. We can therefore handle all questions touching upon a denumerable sequence of random variables. But if M is not denumerable, many simple and interesting subsets of R M remain outside of B% M . For example, the set of all elements £ for which * M remains smaller than a fixed constant for all indices /*, does not belong to the system B% M if the set M is non-denumerable. It is therefore desirable to try whenever possible to put each problem in such a form that the space of all elementary events £ has only a denumerable set of coordinates. Let a probability function P(A) be defined on % M . We may then regard every coordinate % M of the elementary event £ as a random variable. In consequence, every finite group ( x rii> x m»> - • •* x fJ °f these coordinates has an ^-dimensional probability function P^....^^) and a corresponding distribu- 4 From the above it follows that Borel cylinder sets are Borel sets definable by relations of type ( 1 ) . Now let A and B be two Borel cylinder sets defined by the relations /(*/*i. *t* t *#«J = 0» Sfai. *l X U) = • Then we can define the sets A + B, AB, and A-B respectively by the relations f-g = 0, f* + g 2 = 0, where a> (x) = f or x 4= and w (0) = 1 If / and g are Borel functions, so also are f-g, f + g 2 and f + <o{g) ; therefore, A + B, AB and A-B are Borel cylinder sets. Thus we have shown that the system of sets $ 3f is a field. § 4. Probabilities in Infinite-dimensional Spaces 29 tion function ^^...^(fli, a 2 , . . . , a w ). It is obvious that for every Borel cylinder set the following equation holds: p^ = p w ,...,.w, where A' is a Borel set of /?". In this manner, the probability function P is uniquely determined on the field % M of all cylinder sets by means of the values of all finite probability functions P^^ . . . ^ for all Borel sets of the corresponding spaces R n . However, for Borel sets, the values of the probability functions P^,...^ are uniquely determined by means of the corresponding distribution functions. We have thus proved the following theorem : T.he set of all finite-dimensional distribution functions F/hih — i 1 * uniquely determines the probability function P(A) for all sets in $ M . If P(A) is defined on % M , then (according to the extension theorem) it is uniquely determined on B% M by the values of the distribution f unctions F^^...^ . We may now ask the following. Under what conditions does a system of distribution functions F^^,,.^ given a priori define a field of probability on % M (and, consequently, on B% M ) ? We must first note that every distribution function F^/h.../** must satisfy the conditions given in § 3, III of the second chap- ter; indeed this is contained in the very concept of distribution function. Besides, as a result of formulas (13) and (14) in §2, we have also the following relations : F fHifHt ... Hn {a il , a it , . . ., a in ) = F /<l/<2 ... /ttt K, a 2 , . . ., a n ) , (2) *V*...**(«i. a 2 > -■■> a k) =^W,...^K, « 2 . ...,**,+<»,..., +oo),(3) where k < n and [/ / "' n ) is an arbitrary permutation. \*1> *2» • • •» W These necessary conditions prove also to be sufficient, as will appear from the following theorem. Fundamental Theorem: Every system of distribution func- tions F fll H M ...p H , satisfying the conditions (2) and (3), defines a probability function P(A) on % M , which satisfies Axioms I - VI. This probability function P(A) can be extended (by the exten- sion theorem) to B% M also. 30 III. Random Variables Proof. Given the distribution functions ^ 1/ u t ... / . B , satisfying the general conditions of Chap. II, § 3, III and also conditions (2) and (3). Every distribution function &&&... p. defines uniquely a corresponding probability function P^^,...^ for all Borel sets of R n (cf. § 3). We shall deal in the future only with Borel sets of R n and with Borel cylinder sets in E. For every cylinder set we set PW = P*,*,...,^ V ). (4) Since the same cylinder set A can be denned by various sets A', we must first show that formula (4) yields always the same value for P(A). Let (x^, x^ ..., XpJ be a finite system of random variables Xp. Proceeding from the probability function P^^,...^ of these random variables, we can, in accordance with the rules in § 3, define the probability function P^^...^ of each subsystem (x Hi , x H , . . ., x /H ) . From equations (2) and (3) it follows that this probability function defined according to § 3 is the same as the function P^^ 2 . . . Hlt given a priori. We shall now suppose that the cylinder set A is defined by means of A=p;l„ it ... H y) and simultaneously by means of where all random variables x M and * belong to the system ( x /*i > x ht > • • • » *«J » which is obviously not an essential restriction. The conditions and (V , V , ...,*« )cA" are equivalent. Therefore P ^\ H % • • • H k ( A ') = P ^« ■ • • n* {(^ » */4, • * ' ' > X H k ) c ^'} = P^,...^{(*>V X' ' • " **J c A l = %^'^J A ^ > which proves our statement concerning the uniqueness of the definition of P(A). § 4. Probabilities in Infinite-dimensional Spaces 31 Let us now prove that the field of probability (JP, P) satisfies all the Axioms I - VI. Axiom I requires merely that g M be a field. This fact has already been proven above. Moreover, for an arbi- trary /x : P(E) = P fl (R*) = i, which proves that Axioms II and IV apply in this case. Finally, from the definition of P(A) it follows at once that P(A) is non- negative (Axiom III). It is only slightly more complicated to prove that Axiom V is also satisfied. In order to do so, we investigate two cylinder sets and B -«iV-*.<*>. We shall assume that all variables x h . and x N belong to one inclu- sive finite system (x^, x^, . . ., x„ n ) . If the sets A and B do not intersect, the relations [*/%'*/%' -'" x /H k ) (=:A are incompatible. Therefore ?{A + B) = P**.;.*^, x Hi , . . ., *„.J c: A' or (VS'-'SJ^J = P^, fi2 . • • ftn { (^i 1 » ^i, » * ' * ' **fe) C ^ } + P^^...^{(^. , *„ v • • ., *„,J c B'} = P(^) + P(B) , which concludes our proof. Only Axiom VI remains. Let A 1 => A 2 3 ••• id i4 w z> ••• be a decreasing sequence of cylinder sets satisfying the condition lim P(A n ) =L>0. • We shall prove that the product of all sets A n is not empty. We may assume, without essentially restricting the problem, that in the definition of the first n cylinder sets A k , only the first n co- ordinates Xp k in the sequence 32 III. Random Variables occur, i.e. ^ = ^,. ..,.»(£»)• For brevity we set ^, t ...Mn(B) = P n (B); then, obviously P n (B n ) =?(A n ) ^L>0. In each set B n it is possible to find a closed bounded set U n such that P»(B n -U n )^-^. From this inequality we have for the set the inequality Let, morever, " r 1*1 ft • • • f*H V " P(A n -V n )^J-. (5) w n = v x v 2 . . . v n . From (5) it follows that P(A n -W n ) g € . Since W n cV n c:A n , it follows that P(W n )^P(A n )-e^L-8. If e is sufficiently small, P(W n ) > and W n is not empty. We shall now choose in each set W n a point £ U) with the coordinates a» Every point ^ M +^), p = 0, 1, 2, . . . , belongs to the set V n ; therefore (*r p) . *;r p) *< n . + ») = ^....,.(f<»^») c t/„ . Since the sets U n are bounded we may (by the diagonal method) choose from the sequence {£ (n) } a subsequence for which the corresponding coordinates *2? tend for any A: to a definite limit x k . Let, finally, | be a point in set £7 with the coordinates X t*k = x k > x,* = 0, /* + /**• £ = 1,2,3,... § 5. Equivalent Random Variables; Various Kinds of Convergence 33 As the limit of the sequence (x^, 4 Wl) , • . • , #i Wi) ), i = 1, 2, 3, . . . , the point (x lt x 2 , . . . , £fc) belongs to the set U k . Therefore, £ belongs to for any k and therefore to the product k * § 5. Equivalent Random Variables ; Various Kinds of Convergence Starting with this paragraph, we deal exclusively with Borel fields of probability. As we have already explained in § 2 of the second chapter, this does not constitute any essential restriction on our investigations. Two random variables x and y are called equivalent, if the probability of the relation x ^=-y is equal to zero. It is obvious that two equivalent random variables have the same probability func- tion: pu)(A) = ?(y)(A). Therefore, the distribution functions F^ and F-W are also identical. In many problems in the theory of probability we may substitute for any random variable any equivalent variable. Now let X\, X%, . . . , X n , ... \L) be a sequence of random variables. Let us study the set A of all elementary events £ for which the sequence (1) converges. If we denote by A ( ™J the sets of £ for which all the following inequalities hold K+*-*»| <^ k = \,2, ...,p then we obtain at once A = $<§3Mj; . (2) m n p According to § 3, the set A^ always belongs to the field gf. The relation (2) shows that A, too, belongs to 5- We may, therefore, speak of the probability of convergence of a sequence of random variables, for it always has a perfectly definite meaning. Now let the probability P(A) of the convergence set A be equal to unity. We may then state that the sequence (1) con- verges with the probability one to a random variable x, where 34 III. Random Variables the random variable x is uniquely denned except for equivalence. To determine such a random variable we set lim x n n oo on A, and x — outside of A. We have to show that x is a random variable, in other words, that the set A (a) of the elements £ for which x < a, belongs to 5- But A(a) = A<S<£>{x n+p <a} in case a ^ 0,and A (a) = ,4©${* n+p <tf} + ^" n p in the opposite case, from which our statement follows at once. If the probability of convergence of the sequence (1) to x equals one, then we say that the sequence (1) converges almost surely to x. However, for the theory of probability, another con- ception of convergence is possibly more important. Definition. The sequence x u x 2 , . . . , x n , . . '.'. of random vari- ables converges in probability (converge en probability) to the random variable x, if for any £ > 0, the probability tends toward zero as n — ► oo 5 . I. If the sequence (1) converges in probability to x and also to x', then x and x' are equivalent. In fact since the last probabilities are as small as we please for a suffici- ently large n it follows that p |i*-*'i>y=° and we obtain at once that P{x± X '}^]?P{\x- X '\>l t } = 0. m II. // the sequence (1) almost surely converges to x, then it 5 This concept is due to Bernoulli ; its completely general treatment was introduced by E. E. Slutsky (see [1]). § 5. Equivalent Random Variables; Various Kinds of Convergence 35 also converges to x in probability. Let A be the convergence set of the sequence (1) ; then 1 = P(A)^limP{\x n+p -x\<e,p = 0,i,2,...}^limP{\x n -x\<e}, from which the convergence in probability follows. III. For the convergence in probability of the sequence (1) the following condition is both necessary and sufficient: For any £ > there exists an n such that, for every p > 0, the following inequality holds: P {|*n+p-*n|>£}<£ . Let F x (a), F s (a), . . . , F n (a), . . . , F(a) be the distribution functions of the random variables x lt %2, ...,£«,...-, x. If the sequence x n converges in probability to x, the distribution func- tion F(a) is uniquely determined by knowledge of the functions F n (a). We have, in fact, THEOREM : // the sequence x lt x 2 , . . . , x n , . . . converges in probability to x, the corresponding sequence of distribution func- tions F n (a) converges at each point of continuity of F(a) to the distribution function F(a) of x. That F(a) is really determined by the F n (a) follows from the fact that F (a) , being a monotone function, continuous on the left, is uniquely determined by its values at the points of continuity 6 . To prove the theorem we assume that F is continuous at the point a. Let a' < a ; then in case x < a', x n ==^a it is necessary that \ x n -x \ > a - a'. Therefore lim P (x < a, x n ^ a) = , F(a') = P{x<a')^P{x n <a) + P(x<a\x n ^a)=F n (a) + P{x<a',x n ^a), F (a') ^ lim inf F n (a) + lim P (x < a, x n ^ a) , F(a')^\immiF n (a). (3) In an analogous manner, we can prove that from a" > a there follows the relation F(a") ^limsupF c (a). (4) 8 In fact, it has at most only a countable set of discontinuities (see Lebesgue, Legons sur V integration, 1928, p. 50. Therefore, the points of continuity are everywhere dense, and the value of the function F(a) at a point of discon- tinuity is determined as the limit of its values at the points of continuity on its left. 36 III. Random Variables Since F(a') and F(a") converge to F(a) for a' — * a and a" — ► a, it follows from (3) and (4) that limF B (a) = F(a), which proves our theorem. Chapter IV MATHEMATICAL EXPECTATIONS 1 § 1. Abstract Lebesgue Integrals Let # be a random variable and A a set of gf. Let us form, for a positive A, the sum k= +00 S;. ^^H?{kk^f< {k+i)X t (cA}. (1) * = -00 If this series converges absolutely for every A, then as A — ► 0, S k tends toward a definite limit, which is by definition the integral I- xP(dE) . (2) A In this abstract form the concept of an integral was introduced by Frechet 2 ; it is indispensable for the theory of probability. (The reader will see in the following paragraphs that the usual definition for the conditional mathematical expectation of the variable x under hypothesis A coincides with the definition of the integral (2) except for a constant factor.) We shall give here a brief survey of the most important properties of the integrals of form (2) . The reader will find their proofs in every textbook on real variables, although the proofs are usually carried out only in the case where P(A) is the Lebesgue measure of sets in R n . The extension of these proofs to the general case does not entail any new mathematical problem ; for the most part they remain word for word the same. I. If a random variable x is integrable on A, then it is in- tegrate on each subset A' of A belonging to g. II. If x is integrable on A and A is decomposed into no 1 As was stated in § 5 of the third chapter, we are considering in this, as well as in the following chapters, Borel fields of probability only. 2 Frechet, Sur Vintegrale oVune functionnelle etendue a un ensemble abstrait, Bull. Soc. Math. France v. 43, 1915, p. 248. 37 38 IV. Mathematical Expectations more than a countable number of non-intersecting sets A n of gf, then r _ , JxPXdE)=£jxP(dE). A n An III. If x is integrable r | a; | is also integrable, and in that case \jxP(dE)\^j\x\P{dE), A A IV. If in each event |, the inequalities ^ y s^ x hold, then along with x, y is also integrable 3 , and in that case JyP(dE) ^fxP{dE) A A V. If m ^ as g M where m and M are two constants, then m P (A) ^jx P (dE) ^ M P {A) . VI. If £ and y are integrable, and K and L are two real con- stants, then Kx + Ly is also integrable, and in this case j(Kx + Ly) P(dE) = KJxP{dE) + LJyP(dE) . VII. If the series ]?j\x n \P(dE) n A converges, then the series Jmmi Xfi X n converges at each point of set A with the exception of a certain set B for which P(B) — 0. If we set x = everywhere except on A - B t then jxP{dE)=^jx n P(dE). n A VIII. If x and y are equivalent (P{* 4= y) ~ 0)» then ^ or every set A of 5 jxP(dE)=jyP(dE). (3) 3 It is assumed that y is a random variable, i.e., in the terminology of the general theory of integration, measurable with respect to % . § 2. Absolute and Conditional Mathematical Expectations 39 IX. If (3) holds for every set A of gf, then x and y are equivalent. From the foregoing definition of an integral we also obtain the following property, which is not found in the usual Lebesgue theory. X. Let Pi (A) and P 2 (A) be two probability functions denned on the same field %, P ( A ) = P x ( A ) + P 2 ( A \ and let x be integrable on A relative to P 1 (A) and P 2 (A) . Then jxP(dE) =^jxP x (dE) + jxP 2 {dE). AAA XL Every bounded random variable is integrable. § 2. Absolute and Conditional Mathematical Expectations Let a; be a random variable. The integral E(x) = JxP(dE) E is called in the theory of probability the mathematical expectation of the variable x. From the properties III, IV, V, VI, VII, VIII, XI, it follows that I. |.E(*)|£E(|*|); II. E(y) g E(x) if ^ y ^ x everywhere; III. inf (x) ^ E(x) ^ sup (x) ; IV. E(Kx + Ly) = KE(x) 4- LE(y) ; V. E (2 x n) = 2 E (*n) » if the series 2 E ( I *»l ) converges ; \ n I n n VI. If x and y are equivalent then E(z) =E(2/). VII. Every bounded random variable has a mathematical expectation. From the definition of the integral, we have k= +oo E(x) == lim^£raP{&m:^ # < (jfe.-f 1) w} &= — OO = lim^rm{F((^+ l)m) - F(£m)} . 40 IV. Mathematical Expectations The second line is nothing more than the usual definition of the Stieltjes integral +«> jadFW(a) = E(*). (1) —00 Formula (1) may therefore serve as a definition of the mathe- matical expectation E(x). Now let u be a function of the elementary event £, and a; be a random variable defined as a single- valued function x — x(u) of u. Then P{km^x< (k + 1) m} = PW{kfn^ x(u) < (k + \)m}, where P (m) (A) is the probability function of u. It then follows from the definition of the integral that E £(u) and, therefore, E(x) =Jx{u)PM(dE(«)) (2) where E (u) denotes the set of all possible values of u. In particular, when u itself is a random variable we have +00 E(x) =jx P {dE) =jx(u) P^idR 1 ) =jx(a) dFW(a) . (3) E R l -00 When x(u) is continuous, the last integral in (3) is the ordinary Stieltjes integral. We must note, however, that the integral jx(a)dF^{a) can exist even when the mathematical expectation E(x) does not. For the existence of E(x), it is necessary and sufficient that the integral f\x(a)\dF( u ){a) —00 be finite 4 . If u is a point (u lf u 2 , . . . , u n ) of the space R^then as a result of (2): 4 Cf. V. Glivenko, Sur les valeurs probables de fonctions, Rend. Accad. Lincei v. 8, 1928, pp. 480-483. § 2. Absolute and Conditional Mathematical Expectations 41 E{x) = ft. . . fx(u lt u 2 ,..., u n ) P<«i.«*. -. «■> («*#») . (4) We have already seen that the conditional probability P B (A) possesses all the properties of a probability function. The corres- ponding integral Eb(x) = jx? B (dE) (5) E we call the conditional mathematical expectation of the random variable x with respect to the event B. Since p B ( B ) = 0, JxP B (dE) =0 we obtain from (5) the equation E B (x) =fxP B (dE) = jxP B (dE) + jxP B (dE) =JxP B (dE) E B B B We recall that in case AaB, P (A\ - P{AB) P{A "> we thus obtain B From (6) and the equality (B) P(B) ^B(x) = ~ ] jxP(dE), (6) B jxP(dE) = P(B)E B {x). (7) A + B we obtain at last JxP(dE) = JxP(dE) +jxP{dE) P(A)E A (*) + P{B)E B (x) E^W- — p-; 1 -— (8) and, in particular, we have the formula EW = P(A)E A {*) + P(A)Ei(x). (9) 42 IV. Mathematical Expectations § 3. The Tchebycheff Inequality Let f(x) be a non-negative function of a real argument x, which f or x ^ a never becomes smaller than b > 0. Then for any- random variable x p[*^)s», (i) provided the mathematical expectation E {/(*)} exists. For, E{f(x)}=jf(x) P(dE) ^jf(x)P(dE) ^bP(x^a) , from which (1) follows at once. For example, for every positive c , P(x^a)^ E -^. (2) Now let f(x) be non-negative, even, and, for positive x, non- decreasing. Then for every random variable x and for any choice of the constant a > the following inequality holds P(|*| fea)3S Iipp. (3). In particular, P(|* - E(*)| ^ a) £ E/{ Vf W> • (4) 1 f(a) Especially important is the case f(x) = x 2 . We then obtain from (3) and (4) P(\x\&*)^^p. (5) P(|,-E W |^.)^ife^.^, (6) where oHx) = E{x-E(x)}* is called the variance of the variable x. It is easy to calculate that o*(x) = E(x*)-{E(x)y. If f(x) is bounded: \f(x) \^K, then a lower bound for P(\x\ ^ a) can be found. For § 4. Some Criteria for Convergence 43 E (/(*)) ==//(*) P{dE) =jf(x) P(dE) + //(*) P(dE) ^ f{a)P(\x\ < a) + KP()x\ > a) £ /(«) + KP(|*| > a) and therefore P(l^l^a)^ E{/( ^- /( ^ . (7) If instead of f(x) the random variable x itself is bounded, 1*1 ^M , then /(#) g f(M), and instead of (7), we have the formula P(|*|a«U E(/ y (a) . (8) In the case /(#) = a; 2 , we have from (8) § 4. Some Criteria for Convergence Let Xi, %2y • • • y Xni • • • \ * / be a sequence of random variables and f(x) be a non-negative, even, and for positive x a monotonically increasing function 5 . Then the following theorems are true : I. In order that the sequence ( 1 ) converge in probability the following condition is sufficient : For each e > there exists an n such that for every p > 0, the following inequality holds : E {f(x n+p - *„)} < e . (2) II. In order that the sequence (1) converge in probability to the random variable x, the following condition is sufficient : HmE{/(* n -%)} = 0. (3) n-* +oo III. If f(x) is bounded and continuous and /(0) =0, then conditions I and II are also necessary. IV. If f(x) is continuous, /(0) = 0,and the totality of all x u x 2 , . . . , x m . . . , x is bounded,then conditions I and II are also necessary. 5 Therefore f(x) > if x =f= 0. 44 IV. Mathematical Expectations From II and IV, we obtain in particular V. In order that sequence (1) converge in probability to x, it is sufficient that limE(a; n -a;) 2 = . (4) If also the totality of all x lt x 2 , . . . , x n , . . . , x is bounded, then the condition is also necessary. For proofs of I - IV see Slutsky [1] and Frechet [1]. How- ever, these theorems follow almost immediately from formulas (3) and (8) of the preceding section. § 5. Differentiation and Integration of Mathematical Expectations with Respect to a Parameter Let us put each elementary event $ into correspondence with a definite real function x(t) of a real variable t. We say that x(t) is a random function if for every fixed t, the variable x(t) is a random variable. The question now arises, under what conditions can the mathematical expectation sign be interchanged with the integration and differentiation signs. The two following theorems, though they do not exhaust the problem, can nevertheless give a satisfactory answer to this question in many simple cases. Theorem I: // the mathematical expectation E[x(t)~\ is finite for any t, and x(t) is always differ -entiable for any t, while the derivative x' (t) of x(t) with respect to t is always less in abso- lute value than some constant M, then ^E(x(t)) = E(x'(t)). Theorem II: // x(t) always remains less, in absolute value, than some constant K and is integrable in the Riemann sense, then b r b JE(x(t))dt= E jx(t)dt a la provided E[x(t)] is integrable in the Riemann sense. Proof of Theorem I. Let us first note that x' (t) as the limit of the random variables x(t + h)-x(t) 1 1 h n-\, -,...,-, ... is also a random variable. Since x' (t) is bounded, the mathe- § 5. Differentiation and Integration of Mathematical Expectations 45 matical expectation E[x'(t)] exists (Property VII of mathe- matical expectation, in § 2) . Let us choose a fixed t and denote by A the event xjt + h) - xjt) h x'(t) > £ The probability P ( A) tends to zero as h — ► for every e > 0. Since x{t + h) - %{t) M, x(t)\^M holds everywhere, and moreover in the case A \ xjt + h)- xjt) then h -At) Ex(t + h)^- Ex(t) _ Ex , {t) xit + h) - xit) -x\t) P(A)E 2 xit + h) -xit) x'it) P{A)E J h xit + h) - xit) x\t) ^ 2M?iA) + a . We may choose the e > arbitrarily, and P(A) is arbitrarily small for any sufficiently small h. Therefore dt Exit) = lim . Exit + h) -Exit) Exit), h + which was to be proved. Proof of Theorem II. Let k = n s n = {]?x(t + kh), ^-~r- b Since S n converges to J — J x(t) dt, we can choose for any a e > an N such that from n^N there follows the inequality P(^) = P{|S, -/|>£}< £ . If we set k=n S: = l^Exit+kh) = EiS n ), k=\ then |S*-E(/)| = |E(S W -/)|^E|S W -/| P(^) E A \S n - J\ + 9(A) Ei|S n - J\ { ^ 2KP{A) + e ^ (2K + l)e . 46 IV. Mathematical Expectations Therefore, S* converges to E(J) , from which results the equation b Ex(t)dt = limS* n = E(/). /' Theorem II can easily be generalized for double and triple and higher order multiple integrals. We shall give an application of this theorem to one example in geometric probability. Let G be a measurable region of the plane whose shape depends on chance ; in other words, let us assign to every elementary event £ of a field of probability a definite measurable plane region G. We shall denote by / the area of the region G, and by ?(x, y) the prob- ability that the point (x, y) belongs to the region G. Then E{J)=jj?{x,y)dxdy. To prove this it is sufficient to note that / = s fif(x,y)dxdy l P(x;y) = Ef(x,y), where f(x,y) is the characteristic function of the region G (fix, y) — 1 on G and f(x, y) = outside of G) 6 . A- 6 Cf. A. Kolmogorov and M. Leontovich, Zur Berechnung der mittleren Brownschen Fldche, Physik. Zeitschr. d. Sovietunion, v. 4, 1933. Chapter V CONDITIONAL PROBABILITIES AND MATHEMATICAL EXPECTATIONS § 1. Conditional Probabilities In § 6, Chapter I, we denned the conditional probability, P^ (B) , of the event B with respect to trial %. It was there assumed that % allows of only a finite number of different possible results. We can, however, define P% (B) also for the case of an % with an infinite set of possible results, i.e. the case in which the set E is partitioned into an infinite number of non-intersecting subsets. In particular, we obtain such a partitioning if we consider an arbitrary function u of £ and define as elements of the partition 9l„ the sets u = con- stant. The conditional probability P% U {B) we also denote by P U (B). Any partitioning 51 of the set E can be denned as the partitioning 5i M which is "induced" by a function u of £, if one assigns to every $, as u(£), that set of the partitioning 51 of E which contains |. Two functions u and u' of £ determine the same partitioning 5l M = 9l M 'Of the set E if and only if there exists a one-to-one cor- respondence u' = f(u) between their domains $ U) and 5 (M,) such that v! (£) is identical with fu(£) . The reader can easily show that the random variables P M (Z?) and P M *( B), defined below, are in this case the same. They are thus determined, in fact, by the partition 9L = ^itself, To define P U (B) we may use the following equation: P{u C a}(B) = E {ucA} P u (B). (1) It is easy to prove that if the set E (u) of all possible values of u is finite, equation (1) holds true for any choice of A (when P U (B) is defined as in § 6, Chap. I) . In the general case (in which P U (B) is not yet defined) we shall prove that there always exists one and only one random variable P U (B) (except for the matter of equivalence) which is defined as a function of u and which satis- fies equation (1) for every choice of A from 5 (m) sucn that 47 48 V. Conditional Probabilities and Mathematical Expectations PM(A) > 0. The function P U (B) of u thus determined to within equivalence, we call the conditional probability of B with respect to u (or, for a given u) . The value of P M (Z?) when u = awe shall designate by P u (a; B). The proof of the existence and uniqueness of P U (B). If we multiply (1) by P{ucA} = P<«>(A), we obtain, on the left, P{uczA}P ucA {B) = P(B{ucA}) = P\Bu-HAj) and, on the right, P{ucA}E {ucA} P u (B) = JP U (B) P(dE) =JP U (B) P<*>(rf£(«)) ; {ucA} A leading to the formula P(B«- 1 M))=/P u (B)PW(i£W). (2) A and conversely (1) follows from (2). In the case P (u HA) = 0, in which case (1) is meaningless, equation (2) becomes trivially true. Condition (2) is thus equivalent to (1). In accordance with Property IX of the integral (§ 1, Chap. IV) the random variable x is uniquely defined (except for equivalence) by means of the values of the integral fxPd(E) A for all sets of g. Since P U (B) is a random variable determined on the probability field (8f<*>, P (M >),it follows that formula (2) uniquely determines this variable P U (B) except for equivalence. We must still prove the existence of P M (J5). We shall apply here the following theorem of Nikodym 1 : Let 5 be a Borel field, P(A) a non-negative completely additive set function defined on 5 (in the terminology of the probability theory, a random variable on (5, P)), and let Q(A) be another completely additive set function defined on J$f> such that from Q(A)4=0 follows the inequality P(A) > 0. Then there exists a function /(£) (in the terminology of the theory of probability, a random variable) which is measurable with respect to %, and which satisfies, for each set A of 5, the equation 1 0. Nikodym, Sur une generalisation des integrates de M. J. Ra don, Fund. Math. v. 15, 1930 p. 168 (Theorem III). § 1. Conditional Probabilities 49 0(A) = //(f) P(dE). A In order to apply this theorem to our case, we need to prove 1° that Q(A) = P(Bu-HA)) is a completely additive function on Jp>, 2°, that from Q(A) +0 follows the inequality P (M >(A) > 0. Firstly, 2° follows from ^ P{B u-HA)) ^ P(u-HA)) = P< m HA) . For the proof of 1° we set A = Z A n- then u- l (A)=%u-HA n ) n and B«->(^)=2B«- l (4). n Since P is completely additive, it follows that P{BurKA$=2P{Bu-HAj) % n which was to be proved. From the equation (1) follows an important formula (if we set A = #<«>) : P(B) = E(P U (B)). (3) Now we shall prove the following two fundamental properties of conditional probability. Theorem I. It is almost sure that 0^P u (B) gl. (4) Theorem II. // B is decomposed into at most a countable number of sets B n : B = ZB t 'n 9 n then the following equality holds almost surely: , P«(£)=ZP»(£»)- (5) n These two properties of P U (B) correspond to the two char- acteristic properties of the probability function P(B) : that g P(B) ^ 1 always, and that P(B) is completely additive. These 50 V. Conditional Probabilities and Mathematical Expectations allow us to carry over many other basic properties of the absolute probability P(B) to the conditional probability P U (B). However, we must not forget that P U (B) is,for a fixed set B, a random vari- able determined uniquely only to within equivalence. Proof of Theorem I. If we assume — contrary to the assertion to be proved — that on a set M s a E (M > with P (M > (M) > 0, the in- equality P U (B) g 1 +e, e> 0, holds true, then according to for- mula (1) P{uc:M}{B) = E {ucM} P u (B) ^ i + e, which is obviously impossible. In the same way we prove that almost surely P U (B) ^ 0. Proof of Theorem II. From the convergence of the series ZE\P u (B n )\ =2E(P u (fi fl )) =2P(£ n ) = P(B) n n n it follows from Property V of mathematical expectation (Chap. IV, § 2) that the series ■ 2P.(BJ n almost surely converges. Since the series ZE { uoA}\Pu(B n )\=Z E {u<:A}(Pu(Bn)) = £ P{ UC A}(B n ) = P{u C A}(B) n n n converges for every choice of the set A such that P (u> *(A) > 0, then from Property V of mathematical expectation just referred to it follows that for each A of the above kind we have the relation E { uc^}(|;P„(£ n )) =|E(, ei) (W) = P {uc a}(B) = E {ucA} (P u (B n )) f and from this, equation (5) immediately follows. To close this section we shall point out two particular cases. If, first, u(i) = c (a constant), then P C (A) = P(A) almost surely. If, however, we set u(i) = £, then we obtain at once that P$\A) is almost surely equal to one on A and is almost surely equal to zero on A. P${A) is thus revealed to be the characteristic function of set A. § 2. Explanation of a Borel Paradox Let us choose for our basic set E the set of all points on a spherical surface. Our 5 wil1 be the aggregate of all Borel sets of the spherical surface. And finally, our P(A) is to be propor- tional to the measure of set A. Let us now choose two diametrically § 3. Conditional Probabilities with Respect to a Kandom Variable 51 opposite points for our poles, so that each meridian circle will be uniquely defined by the longitude v, ^ ip < n . Since y> varies from only to^r, — in other words, we are considering complete meridian circles (and not merely semicircles) — the latitude must vary from — n to -\-n (and not from — - to + ^ ) . Borel set the following problem: Required to determine "the conditional probability distribution" of latitude t — 7i<0<+tz, for a given longitude^. It is easy to calculate that e % P y> {0 x =g < G 2 } = if\cosG\ d0 . The probability distribution of for a given V is not uniform. If we assume that the conditional probability distribution of "with the hypothesis that $ lies on the given meridian circle" must be uniform, then we have arrived at a contradiction. This shows that the concept of a conditional probability with regard to an isolated given hypothesis whose probability equals is inadmissible. For we can obtain a probability distribution for on the meridian circle only if we regard this circle as an element of the decomposition of the entire spherical surface into meridian circles with the given poles. § 3. Conditional Probabilities with Respect to a Random Variable If a? is a random variable and P X (B) as a function of x is measurable in the Borel sense, then P X (B) can be defined in an elementary way. For we can rewrite formula (2) in § 1, to look as follows : P(£) PJ»(ii) =/P,(B) Pl*)(dE) . (1) A In this case we obtain from (1) at once that a P{B)Ff(a)=JP u (a;BydFW(a) . (2) — oo In accordance with a theorem of Lebesgue 2 it follows from (2) that P^BJ-PWllmgg+j^gg ^o (3) which is always true except for a set H of points a for which P<*> (H) = . 2 Lebesgue, I. c, 1928, pp. 301-302. 52 V. Conditional Probabilities and Mathematical Expectations P x (a; B) was defined in § 1 except on a set G, which is such that P ( *> (G) = 0. If we now regard formula (3) as the defi- nition of P x (a; B) (setting P x (a; B) = when the limit in the right hand side of (3) fails to exist), then this new variable satisfies all requirements of § 1. If, besides, the probability densities f (x) (a) and fg> (a) exist and if f (x Ha) > 0, then formula (3) becomes P I (a;S,= P(S ) ;|W. (4) Moreover, from formula (3) it follows that the existence of a limit in (3) and of a probability density f (x) (a) results in the existence of /</> (a). In that case P(B) 12(a) &#*(*). (5) If P(B) > 0, then from (4) we have In case f (x) (a) = 0, then according to (5) /<*> (a) — and there- fore (6) also holds. If, besides, the distribution of x is continuous, we have + oo +oo P(B) = E(P,(B)) =j'P x (a;B)dFW(a) =j? x (a;B)fW(a)da. (7) — oo — oo From (6) and (7) we obtain /?(«>= + y d -*)™ (8) fP x (a;B)f*{a)da — oo This equation gives us the so-called Bayes* Theorem for continu- ous distributions. The assumptions under which this theorem is proved are these: P X {B) is measurable in the Borel sense and at the point a is defined by formula (3) , the distribution of x is con- tinuous, and at the point a there exists a probability density f ( *Ha). § 4. Conditional Mathematical Expectations Let u be an arbitrary function of £, and y a, random variable. The random variable E m (t/), representable as a function of u and satisfying, for any set A of $ (M > with P (M > (A) > 0, the condition § 4. Conditional Mathematical Expectations 53 E{u*A}(y) = E {uc ^ } E tt (y) f (1) is called (if it exists) the conditional mathematical expectation of the variable y for known value of u. If we multiply (1) by P {u) {A), we obtain jy?(dE)=JE u (y)PM(dEW). (2) {ucA} A Conversely from (2) follows formula (1). In case P (M) (A) — 0, in which case (1) is meaningless, (2) becomes trivial. In the same manner as in the case of conditional probability (§1) we can prove that E„(y) is determined uniquely — except for equiva- lence — by (2). The value of E u (y) for w = awe shall denote by E u (a; y) . Let us also note that E u (y), as well as P u (y), depends only upon the partition 9l M and may be designated by E 9ltt (y) . The existence of E(y) is implied in the definition of E u (y) (if we set A = #<»>, then E {ucA} (y) = E(y)). We shall now prove that the existence of E (y) is also sufficient for the existence of E u (y) . For this we only need to prove that by the theorem of Nikodym (§1), the set function Q(A)=fyP(dE) {ucA} is completely additive on 5 (m) and absolutely continuous with respect to P (m) (A). The first property is proved verbatim as in the case of conditional probability (§1). The second property — absolute continuity — is contained in the fact that from Q(A)^0 the inequality P U) (A) >0 must follow. If we assume that P (M >(A) = P {udA} = 0,it is clear that Q(A)=fyP(dE) = f {ucA} and our second requirement is thus fulfilled. If in equation (1) we set A — E (u \ we obtain the formula E(y) = E E U (V) • (3) We can show further that almost surely E u (ay + bz) = aE u (y) + bE u (z) , (4) where a and b are two arbitrary constants. (The proof is left to the reader.) 54 V. Conditional Probabilities and Mathematical Expectations If u and v are two functions of the elementary event £, then the couple (u, v) can always be regarded as a function of g. The following important equation then holds : E u E {UtV) (y) = E u (y). (5) For,Eu(y) is denned by the relation E{ M c^}(y) = E{ttd}E M (y) , Therefore we must show that E M E (M , V) (y) satisfies the equation E{«cA}(y) = E {Mc ^ } E M E (tt>r) (y) . (6) From the definition of E (u>v) (y) it follows that E{„cA}(y) = E {Mc ^ } E (M>t;) (y) . (7) From the definition of E M E (MjV) (y) it follows, moreover, that E{u*a} E (W)t ,) (y) - E {MC ^ } E m E (M>r) (y) . (8) Equation (6) results from equations (7) and (8) and thus proves our statement. If we set y — P U (B) equal to one on B and to zero outside of B, then E u (y) = P u {B), E {UtU) (y) = P (UtV) (B). In this case, from formula (5) we obtain the formula E M P( M ,„)(B) = -P u (B) . (9) The conditional mathematical expectation E u (y) may also be defined directly by means of the corresponding conditional prob- abilities. To do this we consider the following sums : Sx{u) =~y i °kXP u {kX^y< (k + \)X} = TR k . (10) If E(y) exists, the series (10) almost certainly* converges. For we have from formula (3), of § 1 , E\R k \ = \kk\P{kl&y<(k + i)X}, and the convergence of the series ^ZMP{U^y<(k + i)X}=^E\R k \ We use almost certainly interchangeably with almost surely. § 4. Conditional Mathematical Expectations 55 is the necessary condition for the existence of E(y) (see Chap. IV, § 1). From this convergence it follows that the series (10) con- verges almost certainly (see Chap. IV, §2, V). We can further show, exactly as in the theory of the Lebesgue integral, that from the convergence of (10) for some A, its convergence for every A follows, and that in the case where series (10) converges, S x M tends to a definite limit as A — ► 3 . We can then define E u (y) =limS ; ». (U) To prove that the conditional expectation E u (v) defined by rela- tion (11) satisfies the requirements set forth above, we need only convince ourselves that E M (y), as determined by (11), satisfies equation (1). We prove this fact thus: E{ueA}Eu(y) = hmE {Mc ^ } S;.(w) = lim 2 kX p {u<=A}{k* ^y<(k+l)X}= E {ucA} (y) . '/. -> k — — oo The interchange of the mathematical expectation sign with the limit sign is admissible in this computation, since S x (u) con- verges uniformly to E M (y) as A — ► (a simple result of Property V of mathematical expectation in §2). The interchange of the mathematical expectation sign and the summation sign is also admissible since the series = ^{u,A}{\kX\ ? u [kl ^y < (k + 1) A]} k= — oo = ZW ?{u C A}[kl ^y<(k + \)X\ converges (an immediate result of Property V of mathematical expectation) . Instead of (11) we may write E.(y)=/y P. (<*£). (12) E We must not forget here, however, that (12) is not an integral 3 In this case we consider only a countable sequence of values of A; then all probabilities P u {kl<Zy < (k + i)X\ are almost certainly defined for all these values of A. 56 V. Conditional Probabilities and Mathematical Expectations in the sense of § 1, Chap. IV, so that (12) is only a symbolic expression. If x is a random variable then we call the function of x and a Ff(a) = P s (y<a) the conditional distribution function of y for known x. F x {y) (a) is almost certainly defined for every a. If a < b then almost certainly Ff(a)^Ff(b). From (11) and (10) it follows 4 that almost certainly E x (y) = lim k=% £kX[Ff{{k + \)l) - Ff(kl)] . (13) ;. -+ o k = - oo This fact can be expressed symbolically by the formula + 00 E x (y) = fadFf(a) (14) — oo By means of the new definition of mathematical expectation [(10) and (11)] it is easy to prove that, for a real function of u, E«[/My]=/(«)E M (y) . (15) Cf. footnote 3. Chapter VI INDEPENDENCE; THE LAW OF LARGE NUMBERS § 1. Independence Definition 1 : Two functions, u and v of |, are mutually inde- pendent if for any two sets, A of $ (w) , and B of % (v) , the follow- ing equation holds: P(ucA,vczB) = P{uczA)P{vc:B) = PW(A) P«(B) . (1) If the sets E (u) and E {v) consist of only a finite number of elements, £(«) = % + u 2 + • • • + u n , #*> = »! + . w, + • • • + v m , then our definition of independence of u and v is identical with the definition of independence of the partitions k E =^{v = v k } k as in § 5, Chap. I. For the independence of u and v, the following condition is necessary and sufficient. For any choice of set A in $ (w) the following equation holds almost certainly: P v (uczA) = P{uczA) t (2) In the case P (v >(£) = 0,both equations (1) and (2) are satisfied, and therefore we need only prove their equivalence in the case P (v) (B) > 0. In this case (1) is equivalent to the relation P {vc b}(uczA) = P{uc:A) (3) and therefore to the relation E {vcB} P v {uciA) = P(«c2) . (4) On the other hand, it is obvious that equation (4) follows from 57 58 VI. Independence; The Law of Large Numbers (2). Conversely since P v (uczA) is uniquely determined by (4) to within probability zero, then equation (2) follows from (4) almost certainly. Definition 2 : Let M be a set of functions u^ (I) of t These functions are called mutually independent in their totality if the following condition is satisfied. Let W and M" be two non- intersecting subsets of M, and let A' (or A") be a set from g defined by a relation among u from M' (or M") ; then we have P(A'A") = P(A')P\A"). The aggregate of alP« /t of W (or of M") can be regarded as coordinates of some function v! (or u"). Definition 2 requires only the independence of u' and u" in the sense of Definition 1 for each choice of non-intersecting sets W and M" . If u lt Mz, . . . , w n are mutually independent, then in all cases P{u l aA l , u 2 cA 2 , ..., u n czA n } (K) = P(«! c 4J P(« t c^ 2 ).,P(m b c^ provided the sets A A: belong to the corresponding % {Uk) (proved by induction). This equation is not in general, however, at all sufficient for the mutual independence of u lt u 2 , . . . , u n . Equation (5) is easily generalized for the case of a countably infinite product. From the mutual independence of u^ in each finite group ( w mi» u /*,> •-> u t*k) ft does n °t necessarily follow that all u fl are mutually independent. Finally, it is easy to note that the mutual independence of the functions u^ is in reality a property of the corresponding parti- tions ty Ufl . Further, if u^ are single-valued functions of the cor- responding u fi , then from the mutual independence of u^ follows that of u'. § 2. Independent Random Variables If x u x 2 , . . . , x n are mutually independent random variables then from equation (2) of the foregoing paragraph follows, in particular, the formula F^ * *»> (a v a 2 , . . . , a n ) = F<**> (a x ) F™ (a 2 ) . . . F^) (a n ) . ( 1 ) // in this case the field g (x » **■••> **) consists only of Borel sets of § 2. Independent Random Variables 59 the space R n , then condition (1) is also sufficient for the mutual independence of the variables x lf x 2 , . . . , x n . Proof. Let %' = (x^, x it , . .. . , x ik ) and x"= (x h , x h , . . ., x jm ) be two non-intersecting subsystems of the variables x lt x 2 , . . . , x„. We must show, on the basis of formula ( 1 ) , that for every two Borel sets A' and A" of R k (or R m ) the following equation holds : P (*' G A', x" c A") = P (*' c A') P (*" c .4") . (2) This follows at once from (1) for the sets of the form A' = {x (l < a lf x it < a 2 , . . ., x ik < a k } , A"= K < b lt x h <b 2 , . . . , Af ;m < b m } . It can be shown that this property of the sets A' and A" is pre- served under formation of sums and differences, from which equation (2) follows for all Borel sets. Now let x — {x^} be an arbitrary (in general infinite) aggre- gate of random variables. // the field $ (;r) coincides with the field B$ M (M is the set of all n) , the aggregate of equations JVi,,..../*(*i»*i. .-•»««) =F /Al {a 1 )F fli (a 2 )...F^ n (a n ) (3) is necessary and sufficient for the mutual independence of the variables x u . The necessity of this condition follows at once from formula ( 1 ) . We shall now prove that it is also sufficient. Let M' and M" be two non-intersecting subsets of the set M of all indices ^ and let A' (or A") be a set of B% M defined by a relation among the 'x^ with indices /x from M' (or M") . We must show that we then have P(A'A") = P(^ , )P(^ ,/ ) - (4) If A' and A" are cylinder sets then we are dealing with rela- tions among a finite set of variables * u , equation (4) represents in that case a simple consequence of previous results (Formula (2)). And since relation (4) holds for sums and differences of sets A' (or A") also, we have proved (4) for all sets of B% M as well. Now for every n of a set M let there be given a priori a distri- bution function F^ (a) ; in that case we can construct a field of probability such that certain random variables x^ in that field (p assuming all values in M) will be mutually independent, where XpWill have for its distribution function the F^ (a) given a priori. 60 VI. Independence; The Law of Large Numbers In order to show this it is enough to take R M for the basic set E and B% M for the field g, and to define the distribution functions F/hp*.../** ( see Chap. Ill, § 4) by equation (3). Let us also note that from the mutual independence of each finite group of variables x^ (equation (3)) there follows, as we have seen above, the mutual independence of all x^ on B% M . In more inclusive fields of probability this property may be lost. To conclude this section, we shall give a few more criteria for the independence of two random variables. If two random variables x and y are mutually independent and if E(x) and E(y) are finite then almost certainly E,(y) = E(y) (5) E y (x) = E(x). These formulas represent an immediate consequence of the second definition of conditional mathematical expectation (For- mulas (10) and (11) of Chap. V, § 4). Therefore, in the case of independence both E[y-E,(y)J» and 2 = E[*-E,(*)]» 1 o 2 (y) S o 2 (#) are equal to zero (provided v 2 (x) > and v 2 (y) > 0). The num- ber f 2 is called the correlation ratio of y with respect to x, and g 2 the same for x with respect to y (Pearson) . From (5) it further follows that E(xy) = E(x) E(y) . (6) To prove this we apply Formula (15) of § 4, Chap. V: E(xy) = EE x {xy) = E[xE x (y)] = E[xE(y)] = E(y) E(x) . Therefore, in the case of independence r = E(*,y)-E(x)E(y) o (x) a (y) is also equal to zero; r, as is well known, is the correlation co- efficient of x and y. If two random variables x and y satisfy equation (6), then they are called uncorrected. For the sum S — x* + x 2 + . . . -f x n § 3. The Law of Large Numbers 61 where the x lf x 2 , . . . , x n are uncorrelated in pairs, we can easily compute that o 2 (s) = o 2 (*,) + o*(x 2 ) + • • • + o 2 (*») . (?) In particular, equation (7) holds for the independent variables x k . § 3. The Law of Large Numbers Random variables s of a sequence §lj &2, • • • , O n , . . . are called stable, if there exists a numerical sequence (Zi, ct 2 , . . . , ct n > . • • such that for any positive e P{\s n -d n \^e} converges to zero as n — *► oo . If all E(s n ) exist and if we may set d n = E(s„), then the stability is normal. If all s n are uniformly bounded, then from P{\s n -d n \^e}-+0 » + +oo (1) we obtain the relation |E(s„) - d n \ -> «->+oo and therefore P{|s n -E(s ri )|^ £ }->0. «->+oo (2) The stability of a bounded stable sequence is thus necessarily normal. Let E(s n ~E(s n ))^ = aHs n ) = ^. According to the TchebychefF inequality, P{|s n -E( S „)|^ £ }^^. Therefore, the Markov Condition <4->0 n^+oo (3) is sufficient for normal stability. 62 VI. Independence; The Law of Large Numbers If s n -E(s n ) are uniformly bounded: \s n -E(s n ) \^M, then from the inequality (9) in § 3, Chap. IV, P{| s „-E( Sn )|^}fe^-\ Therefore, in this case the Markov condition (3) is also necessary for the stability of the s n . If _ x x + x 2 H j- x n Sn ~ n and the variables x n are uncorrelated in pairs, we have <* = i*{<y 2 (xi) + * 2 (* 2 ) + ••• + **(*»)}• Therefore, in this case, the following condition is sufficient for the normal stability of the arithmetical means s n : °l = o* ( Xl ) + tf (x 2 ) + • • . + a* (*J = (»*) (4) (Theorem of Tchebycheff) . In particular, condition (4) is ful- filled if all variables x„ are uniformly bounded. This theorem can be generalized for the case of weakly cor- related variables x n . If we assume that the coefficient of correla- tion r mn a of x m and x„ satisfies the inequality r mn ^c(\n-m\) and that c. = 2>(*). jfc = then a sufficient condition for normal stability of the arithmetic means s is 2 C„oi-o(HP). (5) In the case of independent summands x n we can state a neces- sary and sufficient condition for the stability of the arithmetic means s n . For every x n there exists a constant m n (the median of x n ) which satisfies the following conditions: P(*n<**n) ^i> 1 It is obvious that r mn = 1 always. 2 Cf. A. Khintchine, Swr Za loi forkdes grandes nombres. C. R. de l'acad. sci. Paris v. 186, 1928, p. 285. § 3. The Law of Large Numbers 63 We set Xnl ; = %k if I ^fc-m fc | ^ n, Xnk ~~ otherwise, c* _ Xn + *„* + ••• +*„ « j relations k=n ZP{|**- *=1 m k\ k=n I = n 2 p (*•» + **) -* =i + oo (6) oM^)=J> 2 (*n*) = <>(* 2 ) (7) are necessary and sufficient for the stability of variables s n 3 . We may here assume the constants d n to be equal to the E(s„*) so that in the case where E(s*) -E(s n )-»0 w->+cx) (and only in this case) the stability is normal. A further generalization of Tchebycheff 's theorem is obtained if we assume that the s n depend in some way upon the results of any n trials, «i, %, • • • , %n . so that after each definite outcome of all these n trials s n assumes a definite value. The general idea of all these theorems known as the law of large numbers, consists in the fact that if the depend- ence of variables s n upon each separate trial % k (k = 1, 2, . . . , n) is very small for a large n, then the variables s n are stable. If we regard $ik = E[EH 1 a t ...9u(Sn) — E «,9l«...«*-i( S n)] 2 as a reasonable measure of the dependence of variables s n upon the trial E fc , then the above-mentioned general idea of the law of large numbers can be made concrete by the following considera- tions 4 . *n* = E«, «, . . . 21* ( s n) ~ E^ 9t 2 . . . 2U- _ i (s«) • 3 Cf. A. KoLMOGOROy . tlber die Summen durch den Zufall bestimmter unabhangiger Grossen, Math. Ann. v. 99, 1928, pp. 309-319 (corrections and notes to this study, v. 102, 1929 pp. 484-488, Theorem VIII and a supplement on p. 318). 4 Cf. A. KolmogoroY- Sur la loi des grandes nombres. Rend. Accad. Lincei v. 9, 1929 pp. 470-474. 64 VI. Independence; The Law of Large Numbers Then s n — E(s„) = z x + z 2 + • • • 4- z n , E{z nk ) = EE 3ll 9( 8 ... 9lifc (s n ) — EE ? i l9Il ...9i Jt _ 1 (s n ) = E(s n ) - E(s n ) = 0. aM^ t ) = E(4,) = ^.. We can easily compute also that the random variables z nk (k — 1, 2, . . . , n) are uncorrelated. For let i < k ; then 5 E^ x 9l 8 . . . 2U_ i { z ni z nk) = *n» E^i, $(, ....«*_ i ( 2 nit) = ^nttE^M. ...3lt_ x (s„) — E 9tl 9i t ...8i fc _ 1 (s w )] = and therefore E(z ni z nk ) = 0. We thus have 2 (S H ) = 0*(Z ni ) + 0*( Zn2 ) + • • • + O^n) = ft, + # 2 + • • • + fin - Therefore, the condition is sufficient for the normal stability of the variables s n . § 4. Notes on the Concept of Mathematical Expectation We have denned the mathematical expectation of a random variable x as E{x) = fx?{dE) =jadF&{a) , E —<x> where the integral on the right is understood as + oo C E (x) = fa dF& {a) = lim fa dF& (a). b ""* ~ °° ( 1 ) -oo 6 ' The idea suggests itself to consider the expression E*(x) = lim f a dF& {a) b -* +<x> (2) -b Application of Formula (15) in §4, Chap. V. § 4. Notes on the Concept of Mathematical Expectation 65 as a generalized mathematical expectation. We lose in this case, of course, several simple properties of mathematical expectation. For example, in this case the formula E(s + y) = E(x) + E(y) is not always true. In this form the generalization is hardly admissible. We may add however that, with some restrictive supplementary conditions, definition (2) becomes entirely natural and useful. We can discuss the problem as follows. Let X\ t X21 • • • j X n , • • • be a sequence of mutually independent variables, having the same distribution function F (x ^(a) =F (Xn) (a), (n = 1, 2, . . . ) as x. Let further *1 + *2 H 1" *n We now ask whether there exists a constant E* (x) such that for every e > limP(|s n -E*(*)| ><0=O, w^+cx). (3) The answer is : // such a constant E* (x) exists, it is expressed by Formula (2) . The necessary and sufficient condition that Formula (3) hold consists in the existence of limit (2) and the relation P(|*| >n)-o(±). (4) To prove this we apply the theorem that condition (4) is necessary and sufficient for the stability of the arithmetic means s„, where, in the case of stability, we may set 6 + n d n =jadF( x )(a) . — n If there exists a mathematical expectation in the former sense (Formula (1)), then condition (4) is always fulfilled 7 . Since in this case £(x) = E*(x), the condition (3) actually does define a generalization of the concept of mathematical expectation. For the generalized mathematical expectation, Properties I - VII 8 Cf . A. Kolmogorov , Bemerkungen zu meiner Arbeit, "Uber die Summen zufdlliger Grossen" Math. Ann. v. 102, 1929, pp. 484-488, Theorem XII. 7 Ibid, Theorem XIII. 66 VI. Independence; The Law of Large Numbers (Chap. IV, §2) still hold; in general, however, the existence of E*| x | does not follow from the existence of E*(#). To prove that the new concept of mathematical expectation is really more general than the previous one, it is sufficient to give the following example. Set the probability density f (x) (a) equal to Q f{X){a) = (|«| + 2)«ln(|ii| + 2) ' where the constant C is determined by +00 ff&{a)da = i . It is easy to compute that in this case condition (4) is fulfilled. Formula (2) gives the value E*(x) = 0, but the integral +00 +00 j\a\dFW(a)=f\a\fW(a)da —00 —00 diverges. § 5. Strong Law of Large Numbers; Convergence of Series The random variables ,9,, of the sequence Sit *>2> • • • i Snt • • • are strongly stable if there exists a sequence of numbers 0^1 > C^2> • • • > ("nt ' • • such that the random variables S n ~ Q/fi almost certainly tend to zero as n -> +00 . From strong stability follows, obviously, ordinary stability. If we can choose d n = E(s„) , then the strong stability is normal. In the Tchebycheff case, c — *» + * 2 J b Xn § 5. Strong Law of Large Numbers; Convergence of Series 67 where the variables x n are mutually independent. A sufficient 8 condition for the normal strong stability of the arithmetic means s n is the convergence of the series 2*P ■ (i) n=l This condition is the best in the sense that for any series of con- stants b n such that ^ n=l we can build a series of mutually independent random variables x n such that and the corresponding arithmetic means s n will not be strongly stable. If all x n have the same distribution function F (jr > (a) , then the existence of the mathematical expectation E(x)=jadFW(a) — oo is necessary and sufficient for the strong stability of s n ; the sta- bility in this case is always normal 9 . Again, let *£l> X>2) • • • ) X nt . . . be mutually independent random variables. Then the probability of convergence of the series fin (2) n=l is equal either to one or to zero. In particular, this probability equals one when both series jjEfoJ and JSy-fo) n=l n=l converge. Let us further assume y n = x n in case [x n \^l, y n = in case | x n \ > 1. 8 Cf. A. Kolmogorov/ Sur la loi forte des grandes nombres, C. R. Acad. Sci. Paris v. 191, 1930, pp. 910-911. 9 The proof of this statement has not yet been published. 68 VI. Independence; The Law of Large Numbers Then in order that series ( 1 ) converge with the probability one, it is necessary and sufficient 10 that the following series converge simultaneously : CO CO CO Z p (W>1}. Z E (%.) and 2> 2 (y„) - n=l n=l n=l 10 Cf. A. Khintchine and A. Kolmogorov, On the Convergence of Series, Rec. Math. Soc. Moscow, v. 32, 1925, p. 668-677. Appendix ZERO-OR-ONE LAW IN THE THEORY OF PROBABILITY We have noticed several cases in which certain limiting probabilities are necessarily equal to zero or one. For example, the probability of convergence of a series of independent random variables may assume only these two values 1 . We shall prove now a general theorem including many such cases. Theorem : Let x u x z , . . . , x n , . . . be any random variables and let f(Xi, x 2 , . . . , x n , . . .) be a Baire function 2 of the variables x Xt x 2 , . . . , x„, . . . such that the conditional probability P*.*.....*{/(*) = 0} of the relation f{x 1 ,x 2> ...,x n ,...) =0 remains, when the first n variables x lf x 2 , . . . , x„ are known, equal to the absolute probability P{/(*)=0} (1) for every n. Under these conditions the probability (1) equals zero or one. In particular, the assumptions of this theorem are fulfilled if the variables x n are mutually independent and if the value of the function f(x) remains unchanged when only a finite number of variables are changed. Proof of the Theorem : Let us denote by A the event f(x) =0. We shall also investigate the field St of all events which can be defined through some relations among a finite number of vari- 1 Cf . Chap. VI, § 5. The same thing is true of the probability PK-rf„-*o} in the strong law of large numbers ; at least, when the variables x n are mutu- ally independent. 2 A Baire function is one which can be obtained by successive passages to the limit, of sequences of functions, starting with polynomials. 69 70 Appendix ables x n . If event B belongs to ®, then, according to the conditions of the theorem, P M (A) = P(A). (2) In the case P(A) = our theorem is already true. Let now P(A) > 0. Then from (2) follows the formula Pa(B) = P ' { ££I B) = P(B), (3) and therefore P(B) and P A (B) are two completely additive set functions, coinciding on ® ; therefore they must remain equal to each other on every set of the Borel extension B® of the field St Therefore, in particular, P(A) = P A (A\=i, which proves our theorem. Several other cases in which we can state that certain prob- abilities can assume only the values one and zero, were discovered by P. Levy. See P. Lriw, Sur un theoreme de M. Khintchine, Bull, des Sci. Math. v. 55, 1931, pp. 145-160, Theorem II. BIBLIOGRAPHY [1]. Bernstein, S.: On the axiomatic foundation of the theory of proba- bility. (In Russian). Mitt. Math. Ges. Charkov, 1917, Pp. 209-274. [2], — Theory of probability, 2nd edition. (In Russian). Moscow, 1927. Government publication RSFSR. [1]. Borel, E.: Les probability's denombrables et leurs applications arith- metiques. Rend. Circ. mat. Palermo Vol. 27 (1909) Pp. 247-271. [2]. — Principes et formules classiques, fasc. 1 du tome I du Traite des probabilites par E. Borel et divers auteurs. Paris: Gauthier-Villars 1925. [3], — Applications a V arithmetique et a la theorie des fonctions, fasc. 1 du tome II du Traite des probabilites par E. Borel et divers auteurs. Paris: Gauthier-Villars 1926. [1]. Cantelli, F. P. : Una teoria astratta del Calcolo delle probabilita. Giorn. 1st. Ital. Attuari Vol. 3 (1932) pp. 257-265. [2]. — Sulla legge dei grandi numeri. Mem. Acad. Lincei Vol. 11 (1916). [3]. — Sulla probabilita come limite della frequenza. Rend. Accad. Lincei Vol. 26 (1917) Pp. 39-45. [1]. Copeland- H.: The theory of probability from the pofnt of view of admissible numbers. Ann. Math. Statist. Vol. 3 (1932) Pp. 143-156. [1]. DORGE, K.: Zu der von R. von M"'ses gegebenen Begrundung der Wahr- scheinlichkeitsrechnung. Math. Z. Vol. 32 (1930) Pp. 232-258. LI]. Frechet, M.: Sur la convergence en probability. Metron Vol. 8 (1930) Pp. 1-48. [2]. — Recherches theoriques modernes, fasc. 3 du tome I du Traite des probabilites par E. Borel et divers auteurs. Paris: Gauthier-Villars. [1]. Kolmogorov, A.: JJber die analytischen Methoden in der Wahrschein- lichkeitsrechnung. Math. Ann. Vol. 104 (1931) Pp. 415-458. [2]. — The general theory of measure and the theory of probability. (In Russian). Sbornik trudow sektii totshnych nauk K. A., Vol. 1 (1929) pp. 8-21. [1]. Lfiw, P.: Calcul des probabilites. Paris: Gauthier-Villars. [1]. Lomnicki, A.: Nouveaux fondements du calcul des probabilites. Fun- dam. Math. Vol. 4 (1923) Pp. 34-71. [1]. Mises, R. v.: Wahrscheinlichkeitsrechnung. Leipzig u. Wien: Fr. Deuticke 1931. [2], — Grundlagen der Wahrscheinlichkeitsrechnung. Math. Z. Vol. 5 (1919) pp. 52-99. [3]. — Wahrscheinlichkeitsrechnung, Statistik und Wahrheit. Wien: Julius Springer 1928. fc/l3']. — Probability, Statistics and Truth (translation of above). New York: The MacMillan Company 1939. [1]. Reichenbach, H.: Axiomatik der Wahrscheinlichkeitsrechnung. Math. Z. Vol. 34 (1932) Pp. 568-619. [1]. Slutsky, E.: t)ber stochastische Asymptoten und Grenzwerte. Metron Vol. 5 (1925) Pp. 3-89. [2]. — On the question of the logical foundation of the theory of proba- bility. (In Russian). Westnik Statistiki, Vol. 12 (1922), pp. 13-21. [1]. Steinhaus, H.: Les probabilites denombrables et leur rapport a la theorie de la mesure. Fundam. Math. Vol. 4 (1923) Pp. 286-310. [1]. Tornier, E.: Wahrscheinlichkeitsrechnung und Zahlentheorie. J. reine angew. Math. Vol. 160 (1929) Pp. 177-198. L2]. — Grundlagen der Wahrscheinlichkeitsrechnung. Acta math. Vol. 60 (1933) Pp. 239-380. 71 CHELSEA BOOKS ON STATISTICS THE CALCULUS OF FINITE DIFFERENCES By Charles Jordan 1947, Second edition, xxi + 652 pages, 5V& x 814. $5.50 ". . . destined to remain the classic treatment of the subject . . . for many years to come. "In a word, Professor Jordan's work is a most readable and detailed record of lectures on the Calculus of Finite Differences which will certainly appeal tremendously to the statistician and which could have been written only by one possessing a deep appreciation of mathematical statistics." — Harry C. Carver, Founder and formerly Editor oi the ANNALS OF MATHEMATICAL Statistics. A HISTORY OF THE MATHEMATICAL THEORY OF PROBABILITY By I. Todhunter 640 pages, 514 inches by 8 inches, previously published at $8.00. $4.95 Hundreds of problems investigated by the founders and developers of mathematical prob- ability and the methods they used are explained and compared in Todhunter's celebrated treatise. Many problems in probability and statistics that are thought to be new were actually solved by these original thinkers, and the solutions — in considerable detail — are to be found in Todhunter's book. "It [is] a comprehensive treatise on the Theory of Probability, for it introduces [the reader] to almost every process and every species of problem which the literature of the subject can furnish." (Preface.) ASYMPTOTISCHE GESETZE DER WAHRSCHEINLICHKEITSRECHNUNG By A. Khintchine 1933, 82 pages, 5 l / 2 x 8y 2 inches, paper, originally published at $3.85. $2.00 ". . . Khintchine chooses with care, from among many results, only those which by their content and method of treatment contribute most to the uniformity of the . . . theory. Of the other fine points of the exposition a few . . •. are his diligence in pointing out interrelationships . . . his giving short sketches of proofs beforehand . . . and making simplifications in such a way that the idea of the proof comes out as clearly as possible. The monograph is, for these reasons, unusually attractive and inspiring." — Acta Szeged. THE THEORY OF MATRICES By C. C. MacDuffee Second edition. 116 pages 6x9 inches, published originally at $5.20. $2.75 This important work presents a clear and comprehensive picture of present day matrix theory. The author covers the entire field of matrix theory, correlating and integrating the enormous mass of results that have been obtained in the subject. A wealth of new material is incorporated in the text and the relationship of the various topics to the field as a whole is carefully delineated. "No mathematical library can afford to be without this book." — Bulletin of the American Mathematical Society. ERGODENTHEORIE By E. Hopf 1937, 89 pages, 5 l / 2 x 8*4 inches. $2.75 "Measure- theoretic viewpoints are preferred over topological ones throughout because, as the author says, ergodic theory is statistics and statistics is measure theory. ". . . chapter on statistics of mappings and fluxes . . . interesting examples worked out . . . 'individual' ergodic theory, the basis of which is Birkhoff 's theorem . . . applications to the Law of Large Numbers, Wiener's theorem on the spectrum of 'random functions'. . . investigations of the author on geodetic flow . . ." — Bela v. Sz. Nagy, Acta Szeged. VORLESUNGEN t)BER REELLE FUNKTIONEN By C. Caratheodory 2nd., latest complete, edn., 728 pp., 5% x 8V2, originally published at $11.60. $6.95 DETERMINANTENTHEORIE EINSCHLIESSLICH DER FREDHOLMSCHEN DETERMINANTEN By G. Kowaletvski Third edition, 1942, 328 pages, 5 l / 2 x 8, originally published at $6.00. $4.25 From the reviews of earlier editions: "a classic in its field . . . excellent treatise . . . remarkably elegant and lucid. . . . The choice of subjects . . . has been guided by a true sense of values, not by a mere love of formal develop- ments . . . expository powers ... of the first order." — Bulletin of the American Mathematical Society. GRUNDLAGEN DER ANALYSIS By E. Landau 1930, 159 pages, 5% x 8, originally published at $4.00. $2.75 "Certainly no clearer treatment of the foundations of the number system can be offered. . . . Never before has this subject been treated with such explicitness. One can only be thankful to the author for this fundamental piece of exposition which is alive with his vitality and genius." — J. F. Ritt, American Mathematical Monthly. The student who wishes to learn mathematical German will find this book ideally suited to his needs. Less than fifty German words will enable him to read the entire book with only an occasional glance at the vocabulary!* * A complete German-English vocabulary has been added. gOLLEGE LIBRARY Due JUL 8 '6 UL 1 1 O'M I 8 '61 PE C l 'T SlBl^ft, PCTlO^ lf *'. % ',. 7 lbb9Jiili«i:.i'^'-' : - oct 3 a,?* NOV 1 3 * 19! t§58 IBB lg '64 Hi T IB 1 4 'W 3°'sbl 3'65 Ml" ■ >: : tt / / -/ ^ ' S2> . * Grundbegriffe der Wahrscheinli sci 519.1K81gEC2 3 lEbE 03E13 04bT yGlfr" *»,. m JP» m ■to Hawaii Hi HHi B9BHraF : '< ;,w , Hssil? I H IsHmllli I m InliHHil ■ ^h 1 ■m fl!HW«BH «B HHHHI HI fBHHHH am II i M9999R9KB9! HH|H TH II Hi ■Hi